Adaptive estimation of the optimal ROC curve and ... - Semantic Scholar

Report 1 Downloads 92 Views
Adaptive estimation of the optimal ROC curve and a bipartite ranking algorithm

St´ ephan Cl´ emen¸ con Telecom Paristech (TSI) LTCI UMR Institut Telecom/CNRS 5141 [email protected]

Abstract In this paper, we propose an adaptive algorithm for bipartite ranking and prove its statistical performance in a stronger sense than the AUC criterion. Our procedure builds on the RankOver algorithm proposed in (Cl´emen¸con & Vayatis, 2008a). The algorithm outputs a piecewise constant scoring rule which is obtained by overlaying a finite collection of classifiers. Here, each of these classifiers is the empirical solution of a specific minimum-volume set (MV-set) estimation problem. The main novelty arises from the fact that the levels of the MV-sets to recover are chosen adaptively from the data to adjust to the variability of the target curve. The ROC curve of the estimated scoring rule may be interpreted as an adaptive spline approximant of the optimal ROC curve. Error bounds for the estimate of the optimal ROC curve in terms of the L∞ -distance are also provided.

1

Introduction

Since a few decades, ROC curves have been widely used as the golden standard for assessing performance in areas such as signal detection, medical diagnosis, credit risk screening. More recently, ROC analysis has become an area of growing interest in Machine Learning. Various aspects are considered in this new approach such as model evaluation, model selection, machine learning metrics for evaluating performance, model construction, multiclass ROC, geometry of the ROC space, confidence bands for ROC curves, improving performance of classifiers, connection between classifiers and rankers, model manipulation (see for instance (Flach, 2004) and references therein). We focus here on the problem of bipartite ranking and the issue of ROC curve optimization. Previous work on bipartite ranking ((Freund et al., 2003), (Agarwal et al., 2005), (Cl´emen¸con et al., 2008)) considered the AUC criterion as the optimization target. However, this criterion is known to weight the errors uniformly while ranking rules with similar AUC may behave very differently on a subset of the input space.

Nicolas Vayatis ENS Cachan & UniverSud CMLA UMR CNRS 8536 [email protected] In the paper, we focus on two problems: (i) the estimation of the optimal ROC∗ , (ii) the construction of a consistent scoring rule whose ROC curve converges in supremum norm to the optimal ROC∗ . In contrast to binary classification or AUC maximization, the classical empirical risk minimization approach cannot be invoked here because of the function-like nature of the performance measure and the use of the supremum norm as a metric. The approach taken here follows the perspective sketched in (Cl´emen¸con & Vayatis, 2008a), and further explored in (Cl´emen¸con & Vayatis, 2008b). In these two papers, ranking rules made of overlaying classifiers were considered and the RankOver algorithm was introduced. Dealing with a function-like optimization criterion as the ROC curve requires to perform both curve approximation and statistical estimation. In the RankOver algorithm, the approximation step is conducted with a piecewise linear approximation with fixed breakpoints on the false positive rate axis. The estimation part involves a collection of classification problems with mass constraint. In (Cl´emen¸con & Vayatis, 2008b), we improved this step by using a modified minimumvolume set approach inspired from (Scott & Nowak, 2006) to solve this collection of constrained classification problems. More precisely, our method can be understood as a statistical version of a simple finite element method with an explicit scheme: it produces an accurate spline estimate of the optimal curve in the ROC space, together with a scoring rule whose ROC curve mimics the behavior of the optimal one. In our previous work (Cl´emen¸con & Vayatis, 2008a), (Cl´emen¸con & Vayatis, 2008b), bounds on the generalization rate of this ranking algorithm were obtained under strong conditions on the regularity of the optimal ROC curve. Indeed, it was assumed that the optimal ROC curve was twice continuously differentiable and that its derivative was bounded in the neighborhood of the origin. The purpose of this paper is to relax these regularity conditions. In particular, we provide an adaptive algorithm which selects breakpoints for the approximation of the ROC curve by the means of a data-driven scheme which takes into account the variability of the target curve. Hence, the partition of the false positive rate axis is chosen according to the local regularity of the optimal curve. The paper is structured as follows. In Section 2, notations are set out and important concepts of ROC

analysis are briefly described. Section 3 is devoted to the presentation of the adaptive approximation of the optimal ROC curve with dyadic recursive partitioning. In Section 4, theoretical results related to empirical minimum-volume set (MV-set) estimation are recalled. The adaptive statistical method for estimating the optimal ROC curve and the related ranking algorithm are presented in sections 5 and 6 respectively, together with the main results of the paper. Proofs are postponed to the Appendix.

2

Setup

2.1 Probabilistic model The probabilistic setup is the same as the one in standard binary classification. Here and throughout, (X, Y) denotes a pair of random variables where Y ∈ {−1, +1} is a binary label and X models some observation for predicting Y, taking its values in a high-dimensional feature space X ⊂ Rd . The joint distribution of (X, Y) is entirely determined by the pair (µ, η) where µ denotes the marginal distribution of X and the regression function η(x) = P{Y = +1 | X = x}, x ∈ X . We also introduce the theoretical proportion p = P{Y = +1}, as well as G and H, the conditional distributions of X given Y = +1 and Y = −1 respectively. Through the paper, these probability measures are assumed to be absolutely continuous with respect to each other. Equipped with these notations, one may write η(x) = p(dG/dH)(x)/(1 − p + p(dG/dH)(x)) and µ = pG + (1 − p)H. 2.2 Bipartite ranking and ROC curves We briefly recall the issue of the bipartite ranking task and describe the key notions related to this statistical learning problem. Based on the observation of i.i.d. examples Dn = {(Xi , Yi ) : 1 ≤ i ≤ n}, the goal is to learn how to order all instances x ∈ X in a way that instances X such that Y = +1 appear on top in the list with the largest possible probability. Clearly, the simplest way of defining an order relationship on X is to transport the natural order on the real line to the feature space through a scoring rule s : X → R. The notion of ROC curve, which we recall below, provides a functional criterion for evaluating the performance of the ordering induced by such a function. We denote by F−1 (t) = inf{u ∈ R : F(u) ≥ t} the pseudo-inverse of any c` ad-l` ag increasing function F : R → R and by S the set of all scoring functions, i.e. the space of real-valued measurable functions on X . Definition 1 (ROC curve) Let s ∈ S. The ROC curve of the scoring function s(x) is the c` ad-l` ag curve given by α ∈ [0, 1] 7→ ROC(s, α) = 1 − Gs ◦ H−1 s (1 − α), where Gs and Hs denote the conditional distributions of s(X) given Y = +1 and given Y = −1 respectively. We denote by ROC∗ the optimal ROC curve for s = η. When Gs (du) and Hs (du) are both continuous distributions, the ROC curve of s(x) is nothing else than

the PP-plot: t 7→ (P{s(X) ≥ t | Y = −1}, P{s(X) ≥ t | Y = +1}) . (1) It is a well-known result in ROC analysis that increasing transforms of the regression function η(x) form the class S ∗ of optimal scoring functions in the sense that their ROC curve, namely ROC∗ = ROC(η, .), dominates the ROC curve of any other scoring function s(x) everywhere: ∀α ∈ [0, 1[, ROC(s, α) ≤ ROC∗ (α). The proof of this fact is based on a simple application of Neyman-Pearson’s lemma in hypothesis testing: the likelihood statistic Φ(X) = (1 − p)η(X)/(p(1 − η(X))) yields a uniformly most powerful statistical test for discriminating between the composite hypotheses H0 : Y = −1 and H1 : Y = +1 (i.e. H0 : X ∼ H and H1 : X ∼ G). Therefore, the power of any other test s(X) is smaller that the test based on η(X) at the same level α. It is also noteworthy that, when continuous, the curve ROC∗ is concave. One may refer to (Cl´emen¸con & Vayatis, 2008c) for a detailed list of properties of ROC curves. Remark 1 (Alternative convention.) Note that ROC curves may be alternatively defined through the representation given in formula (1). With this convention, jumps in the graph, due to possible degeneracy points of H∗ and G∗ , are continuously connected by line segments, see (Cl´emen¸con & Vayatis, 2008d) for instance. Hence, a good scoring function is such that, for any level α ∈ (0, 1), the power ROC(s, α) of the test it defines is close to the optimal power ROC∗ (α). The supnorm ||ROC∗ − ROC(s, .)||∞ = sup {ROC∗ (α) − ROC(s, α)} α∈(0,1)

provides a natural way of measuring the performance of a scoring rule s(x). The ROC curve of the scoring function s(x) can be straightforwardly estimated from the training dataset Dn by the stepwise function bs ◦ H b −1 (1 − α), [ α) = 1 − G α ∈ (0, 1) 7→ ROC(s, s where

  b (t)  H   s  b    Gs (t) Pn

1 n− 1 = n+

=

X

I{s(Xi ) ≤ t}

i: Yi =−1

X

, I{s(Xi ) ≤ t}

i: Yi =+1

with n+ = i=1 I{Yi = +1} = n − n− . However, the target curve ROC∗ is unknown in practice and no empirical counterpart is directly available for the deviation ||ROC∗ − ROC(s, .)||∞ . For this reason, empirical risk minimization (ERM) strategies are generally based on the L1 -distance, leading to the popular

AUC criterion: minimizing ||ROC∗ − ROC(s, .)||L1 ([0,1]) indeed boils down to maximize Z1 def ROC(s, α)dα . AUC(s) =

without necessarily disposing of the corresponding levels. Indeed, any scoring function of the form Z1 (2) s∗ (x) = I{η(x) > Q∗ (α)} dν(α), 0

α=0

An empirical counterpart of the AUC may be built from the Mann-Whitney statistic, see (Cl´emen¸con et al., 2005) and the references therein. Beyond this computational advantage, it is noteworthy that two scoring functions may have same the AUC but their ROC curves can present very different shapes. Since the L1 -distance does not permit to account for local properties of the ROC curve, we point out the importance of deriving strategies for ROC curve estimation and optimization whose convergence is validated in a stronger sense than the AUC. The goal of this paper is precisely to provide an adaptive procedure for estimating ROC∗ in sup norm under mild regularity conditions. Regularity of the curve ROC∗ . In the subsequent analysis, the following assumptions will be required. A1 The conditional distributions G∗ (dt) and H∗ (dt) of the random variable η(X) are continuous A2 The density of the distribution H∗ is strictly positive on (0, 1). We recall that under these assumptions one may explicit the derivative of ROC∗ . For any α ∈ (0, 1), we denote by Q∗ (α) the quantile of order (1−α) of the conditional distribution of η(X) given Y = −1. Lemma 2 (Cl´emen¸con & Vayatis, 2008d). Suppose that assumptions A1 − A2 are fulfilled. Let α ∈ (0, 1) such that Q∗ (α) < 1. Then, ROC∗ is differentiable at α and dROC∗ 1−p Q∗ (α) (α) = · . dα p 1 − Q∗ (α) In (Cl´emen¸con & Vayatis, 2008b), a statistical procedure for estimating the curve ROC∗ , mimicking a linear spline approximation scheme, has been proposed in a very restrictive setup, stipulating that ROC∗ is of class C2 with its two first derivatives bounded. As shown 0 by the result above, boundedness of ROC∗ means that . ∗ ∗ Q (0) = limα→0 Q (α) < 1, in other words that η(X) stays bounded away from 1, or equivalently that the likelihood ratio Φ(X) = (1 − p)η(X)/(p(1 − η(X))) remains bounded. It is the purpose of this paper to examine to which extent one may estimate ROC∗ under weaker assumptions (see assumption A5 below), including cases where it has a vertical tangent at the origin. 2.3

Ranking by overlaying classifiers

From the angle embraced in this paper, ranking amounts to recovering the decreasing collection of level sets of the regression function η(x): {{x ∈ X | η(x) > u}, u ∈ [0, 1]} ,

where ν(dα) is an arbitrary finite positive measure on [0, 1] with same support as the distribution H∗ , is optimal with respect to the ROC criterion. The next proposition also illustrates this view on the problem. We set the notations: R∗α = {x ∈ X | η(x) > Q∗ (α)} , Rs,α = {x ∈ X | s(x) > Q(s(X), α)} , where Q(s(X), α) is the quantile of order (1 − α) of the conditional distribution of s(X) given Y = −1. Proposition 3 (Cl´emen¸con & Vayatis, 2008b). Let s be a scoring function and α ∈ (0, 1) such that Q∗ (α) < 1. Suppose additionally that the cdf Hs (respectively, H∗ ) is continuous at Q(s(X), α) (resp. at Q∗ (α)). Then, we have: E(|η(X) − Q∗ (α)| I{X ∈ A}) ROC∗ (α) − ROC(s, α) = , p(1 − Q∗ (α)) where A = A(s, α) = R∗α ∆Rs,α and ∆ denotes the symmetric difference between sets. This result shows that the pointwise difference between the dominating ROC curve and the one related to a candidate scoring function s may be interpreted as the error made in recovering the specific level set R∗α through Rs,α .

3

Adaptive approximation

Here we focus on very simple approximants of ROC∗ , taken as piecewise constant curves. Precisely, to any subdivision σ : α0 = 0 < α1 < . . . < αK < αK+1 = 1 of the unit interval, we associate the curve given by: ∀α ∈ (0, 1), ∗

Eσ (ROC )(α) =

K X

I{α ∈ [αk , αk+1 [} · ROC∗ (αk ). (3)

k=0

We point out that the approximant Eσ (ROC∗ )(α) is actually a ROC curve. It coincides indeed with ROC(s∗σ , .) where s∗σ is the piecewise constant scoring function given by: K+1 X ∀x ∈ X , s∗σ (x) = I{x ∈ R∗αk } , (4) k=1

which is obtained by ”overlaying” the regression level sets R∗αk = {x ∈ X : η(x) > Q∗ (αk )}, 1 ≤ k ≤ K. Adaptive approximation. In free knot splines, it is well-known folklore that the approximation rate in supremum norm by piecewise constant functions with at most K pieces is of the order O(K−1 ), if and only if the target function belongs to the space BV([0, 1]) 1 , 1. Recall that the space BV([0, 1]) of functions of bounded variation on (0, 1) is the space of absolutely continuous functions f : (0, 1) → R such that f 0 ∈ L1 ([0, 1])

see Chapter 12 in (Devore & Lorentz, 1993). From a practical perspective however, in absence of full knowledge of the target curve, it is a very challenging task to determine a grid of points {αk : 1 ≤ k ≤ K} that yields a nearly optimal approximant. In the case where the points of the mesh grid are fixed in advance independently from the curve f to approximate, say with uniform spacing, the rate of approximation is of optimal order O(K−1 ) if and only if f belongs to the space Lip1 ([0, 1]) of absolutely continuous functions f such that f 0 ∈ L∞ ([0, 1]). The latter condition is precisely the type of assumption we would like to avoid in the present work. We propose to use adaptive approximation schemes instead of fixed grids. In such procedures, the mesh grid is progressively refined by adding new breakpoints, as further information about the local variation of the target is gained: this way, one uses a coarse mesh where the target is smooth, and a finer mesh where it exhibits high degrees of variability. Given the properties of the target ROC∗ (concave and non decreasing curve connecting (0, 0) to (1, 1)), an ideal mesh grid should be finer and finer as one gets close to the origin, see Fig. 1. Dyadic recursive partitioning. For computational reasons, here we shall restrict ourselves only to a dyadic grid of points αj,k = k2−j , with j ∈ N and k ∈ {0, . . . , 2j − 1}, and to partitions of the unit interval [0, 1] produced by recursive dyadic partitioning: any dyadic interval Ij,k = [αj,k , αj,k+1 ) is possibly split into two halves, producing two siblings Ij+1,2k and Ij+1,2k+1 , depending on the (estimated) local properties of the target curve. The adaptive estimation algorithm described in the next section will then appear as a top-down search strategy through a tree structure T , on which the Ij,k ’s are aligned. Precisely, we will consider approximants of the form: X ROC∗ (αj,k ) · I{α ∈ [αj,k , αj,k+1 )}, Ij,k ∈{terminal nodes}

where the sum is taken over all dyadic intervals corresponding to terminal nodes and the weights ω(·) fulfill the following two conditions: (i) (Keep-or-kill) For any dyadic interval I ⊂ [0, 1), the weight ω(I) belongs to {0, 1}. (ii) (Heredity) If ω(I) = 1, then for any dyadic subinterval I 0 such that I ⊂ I 0 , we have ω(I 0 ) = 1. If ω(I) = 0, then for any dyadic subinterval I 0 ⊂ I, we have ω(I 0 ) = 0. Each collection ω of weights satisfying these two constraints is said admissible and determines the nodes of a subtree Tω of the tree T representing the set of all dyadic intervals. A dyadic subinterval I will be said terminal when ω(I) = 1 and ω(I 0 ) = 0 for any dyadic subinterval I 0 ⊂ I: terminal subintervals correspond to the outer leaves of Tω and form a partition Pω of [0, 1). The algorithm described in the next section consists of selecting those intervals, i.e. the set ω. We denote by

σω the mesh grid made of endpoints of terminal subintervals selected by the collection of weights ω. Given two admissible sequences of weights ω1 and ω2 , the mesh σω1 is said finer than σω2 when {I : ω2 (I) = 0} ⊂ {I : ω1 (I) = 0}.

Figure 1: A piecewise constant approximant of ROC∗ .

4

Empirical MV-set estimation

Beyond the functional approximation facet of the problem, another key ingredient of the estimation procedure consists of estimating specific points (αj,k , ROC∗ (αj,k )) = (H(R∗αj,k ), G(R∗αj,k )) lying on the optimal ROC curve, in order to gain information about its location in the ROC space and the way it locally varies both at the same time. Following in the footsteps of (Cl´emen¸con & Vayatis, 2008b), a constructive approach to this problem lies in viewing X \ R∗α as the solution of the following minimum-volume set (MV-set) estimation problem: min G(W) subject to H(W) > 1 − α, W∈B(X )

where the maximum is taken over the set B(X ) of all measurable subsets W ⊂ X . Equivalently, this boils down to solve the constrained optimization problem: sup G(R) subject to H(R) ≤ α. R∈B(X )

From a statistical perspective, the search should be based on the empirical distributions:  Pn b H(dx) = n1− i=1 I{Yi = −1} · δXi (dx) Pn , b G(dx) = n1+ i=1 I{Yi = +1} · δXi (dx) where δx denotes the Dirac mass at x ∈ X . An empirical version of the optimization problem above is then OP(α, φ) :

b b sup G(R) subject to H(R) ≤ α + φ, R∈R

where φ is a complexity penalty and R a class of meabα a solution of this surable subsets of X . We denote by R problem. The success of this program hinges upon the richness of the class R and the calibration of the tolerance parameter φ, as shown by the next result established in (Cl´emen¸con & Vayatis, 2008b) (see also (Scott & Nowak, 2005)) and involving the following technical assumptions. A3 For all α ∈ (0, 1), we have R∗α ∈ R. A4 The set R is such that the Rademacher average ! n 1 X An = E sup i I{Xi ∈ R} R∈R n i=1

−1/2

is of order O(n

).

Note that the assumption A4 is satisfied, for instance, when R is a VC class (see for instance (Boucheron et al., 2005) for the use of Rademacher averages in complexity control). Theorem 4 (Cl´emen¸con & Vayatis, 2008b). Suppose that assumptions A1 − A4 are fulfilled and for all δ ∈ (0, 1), set r 2 log(1/δ) φ = φ(δ, n) = 2An + . n Then, there exists a constant c < ∞ such that for all δ ∈ (0, 1), we have with probability at least 1 − δ: ∀n ∈ N∗ , ∀α ∈ (0, 1), bα ) ≤ α + 2φ(δ/2, n) , and H(R bα ) ≥ ROC∗ (α) − 2φ(δ/2, n) . G(R Remark 2 (Regularity vs. noise condition) Under the additional condition that the distribution η(X), denoted by F∗ = pG∗ +(1−p)H∗ , has a bounded density f∗ , the following extension of Tsybakov’s noise condition ((Tsybakov, 2004)) is fulfilled for any α ∈ (0, 1): ∀t ≥ 0, P {|η(X) − Q∗ (α) ≤ t|}

≤ c·t

a 1−a

,

Adaptive estimation - Algorithm 1 (Input.) Target tolerance  ∈ (0, 1). Volume tolerance φ > 0. Training data Dn = {(Xi , Yi ) : 1 ≤ i ≤ n}. Class R of level set candidates. 1. (Initialization.) Set ω (I0,0 ) = 0 and ω (Ij,k ) = 1 for all dyadic subinterval I $ b 0,0 = 0 and β b 0,1 = 1. I0,0 = [0, 1). Take β 2. (Iterations.) For all j ≥ 0, for all k ∈ {0, . . . , 2j − 1}: if ω (Ij,k ) = 0, then b j,k+1 − β b j,k , b j,k ) = β (a) Compute E(I b j,k ) > , then (b) If E(I i. set ω (Ij+1,2k ) = ω (Ij+1,2k+1 ) = 0 , ii. solve the problem OP(αj+1,2k+1 , φ) bα → solution R , j+1,2k+1 iii. update:

b j+1,2k β b j+1,2k+2 β

b j,k = β b j,k+1 . = β

(c) Else, let the weights of the siblings of the Ij,k unchanged. 3. (Stopping rule.) The algorithm terminates as soon as the weights ω(·) of the nodes of the current level j are all equal to 1. b the collection of dyadic (Output.) Let σ levels αj,k corresponding to the terminal nodes defined by ω . Compute the ROC∗ estimate: X \∗ (α) = b R bα ) · I{α ∈ Ij,k }. ROC G( j,k αj,k ∈b σ

(5)



with a = 1/2 and c = supt f (t). Notice that this condition is incompatible with assumption A2 when a > 1/2. It has been shown in (Cl´emen¸con & Vayatis, 2008b) (see Theorem 12 therein) that, under this assumption, the b R bα ) is then of order O(n−5/8 ). deviation ROC∗ (α) − G(

5

Adaptive estimation of the optimal ROC curve

In this section we describe an adaptive algorithm for estimating the optimal curve ROC∗ by piecewise constants. It should be interpreted as a statistical version of the adaptive approximation scheme studied in (Devore, 1987). We emphasize that the crucial difference with the approach developed in (Cl´emen¸con & Vayatis, 2008b) is that, here, the mesh grid, the cardinality of the grid of points as well as their locations, used for

computing the ROC∗ estimate, are entirely learnt from the data. In this respect, we define the local error empirical estimate on the subinterval I = [α1 , α2 ) ⊂ [0, 1) as b R bα ) − G( b R bα ). b = G( E(I) 2 1 b is nonnegative (by construction, the The quantity E(I) b R bα ) is non decreasing with mapping α ∈ (0, 1) 7→ G( probability one) and should be viewed as an empirical . counterpart of E(I) = ROC∗ (α2 ) − ROC∗ (α1 ), which provides a simple way of estimating the variability of the (nondecreasing) function ROC∗ on I. This measure b is additive, as its statistical version E(.): E(I1 ∪ I2 ) = E(I1 ) + E(I2 )

for any siblings I1 and I2 of the same subinterval. In addition, it controls the approximation rate of ROC∗ by a constant on any interval I ⊂ [0, 1) in the sense that: inf ||ROC∗ (.) − c||L∞ (I) ≤ E(I).

c∈[0,1)

The adaptive algorithm designated as ’Algorithm 1’ is based on the following principle: a dyadic subinterval I will be part of the final partition of the true positive rate axis whenever the empirical local error has not met the tolerance  on none of its ancestors J ⊃ I but meets the tolerance on it. We point out that, by construction, the sequence ω produced by Algorithm 1 is admissible. Remark 3 (On the stopping rule) One should nob tice that, as H(R) ∈ {k/n : k = 0, . . . , n} for any R ∈ R, the estimation algorithm necessarily stops before exceeding the level j = j(n) = blog(n)/ log(2)c: the \∗ has no more than 2j(n) pieces. empirical estimate ROC We now establish a rate of convergence for Algorithm 1. The following assumption shall be required. It classically permits to control the rate at which the derivative of ROC∗ (α) may goes to infinity as α tends to zero, see (Bennett & Sharpley, 1988). 0

A5 The derivative ROC∗ belongs to the space L log L of Borel functions f : (0, 1) → R such that: Z1 def ||f||L log L = (1 + log |f(α)|)|f(α)|dα < ∞. α=0

The next result provides a bound for the rate of the estimator produced by Algorithm 1. Theorem 5 Let δ ∈ (0, 1). Suppose that assumptions . A1 − A5 are fulfilled. Take  = (δ, n) = 7φ(δ/2, n). Then, we have, with probability at least 1 − δ: ∀n ≥ 1, \∗ ||∞ ≤ 16φ(δ/2, n) . ||ROC∗ − ROC Moreover, the number of terminal nodes in the output of Algorithm 1 satisfies the following upper bound: ∗0

#b σ ≤ κ

||ROC ||L log L , φ(δ/2, n)

(6)

for some constant κ. Corollary 6 Let δ ∈ (0, 1). Suppose that assumptions A1 − A5 are fulfilled. Take  and φ of the order of p n−1 log(1/δ). Then,there exists a constant c such that we have, with probability at least 1 − δ: ∀n ≥ 1, r c 2 log(1/δ) ∗ ∗ \ ||ROC − ROC ||∞ ≤ √ + , n n b whose and the adaptive Algorithm 1 builds a partition σ √ cardinality is at most of the order of n.

Remark 4 (On the rate of convergence.) When assuming ROC∗ of class C 1 on [0, 1] (which implies in 0 particular that ROC∗ is bounded in the neighborhood of 0), it may be shown that a piecewise constant estimate with rate O(n−1/2 ) can be built using K = O(n1/2 ) equispaced grid points, cf (Cl´emen¸con & Vayatis, 2008b). It is remarkable that, with the adaptive scheme of Algorithm 1, comparable performance is achieved, while relaxing significantly the smoothness assumption on ROC∗ . Remark 5 (On lower bounds.) To our knowledge, no lower bound result related to statistical estimation of ROC∗ in sup norm is currently available in the literature. Intuition indicates that the rate O(n−1/2 ) is accurate, insofar as, in absence of further assumption, it is likely that it is the best rate that can be obtained for the MV-set estimation problem, and consequently for local estimation of ROC∗ at a given point α ∈ (0, 1).

6

Adaptive ranking algorithm

We now tackle the problem of building a scoring function b s(x) whose ROC curve is asymptotically close to \∗ . In general, the latter is the empirical estimate ROC bα , not a ROC curve: by construction, the sequence R j,k b , sorted by increasing order of magnitude of (j, k) ∈ σ their level αj,k , is not necessarily increasing, as it would have been the case with the true level sets the R∗αj,k . This induces an additional ’Monotonicity’ step in Algorithm 2, before overlaying the estimated sets.

Adaptive RankOver - Algorithm 2 (Input.) Target tolerance  ∈ (0, 1). Volume tolerance φ > 0. Training data Dn = {(Xi , Yi ) : 1 ≤ i ≤ n}. Class R of level set candidates. 1. (Algorithm 1.) Run Algorithm 1, in order to get the regression level estimates bα(1) , . . . , R b b , where K b  = #b R σ and 0 = α(K ) b α(1) < . . . < α(K ) < 1. 2. (Monotonocity.) Form recursively the non eα defined by: decreasing sequence R j,k eα(1) = R bα(1) and R eα(k+1) = R eα(k) ∪ R bα(k) R b. for 1 ≤ k < K (Output.) Build the piecewise constant scoring function: b s (x) =

b K X k=1

eα(k) }. I{x ∈ R

Remark 6 (Top-down vs. Bottom-up) Alternatively, a monotonous sequence of sets can be built from bα(k) , 1 ≤ k ≤ K b  } the following way: the collection {R ¯ b ¯ ¯ b set Rα(K b  ) = Rα(K b  ) and Rα(k) = Rα(k+1) ∩ Rα(k) for b  − 1, . . . , 1. A similar result as the one stated bek=K PK b ¯ α(k) }. I{x ∈ R low can be established for s¯ (x) = k=1

The next theorem states the consistency of the estimated scoring function under the same complexity and regularity assumptions. Theorem 7 Let δ ∈ (0, 1). Suppose that assumptions A1 − A5 are fulfilled. Take a target tolerance  of the order of n−1/6 . Then, there exists a constant c = c(δ) > 0 such that we have with probability at least 1 − δ: ∀n ≥ 1, r log n ∗ ||ROC(b s , ·) − ROC ||∞ ≤ c . n1/3

curve. In this approach, the bipartite ranking problem reduces to a collection of classification problems with an additional constraint, solved by an empirical MV-set procedure.

Appendix - Proofs Proof of Proposition 3 The proof is taken from (Cl´emen¸con & Vayatis, 2008b) and we provide it here for completeness. First, we observe that, for any measurable function h, we have, by a change of probability, that: E(h(X) | Y = +1) =     1−p η(X) E h(X) Y = −1 . p 1 − η(X) We apply this to h(X) = I{X ∈ R∗α } − I{X ∈ Rs,α } in order to get:

We observe that the rate of convergence of the order of n−1/6 obtained in Theorem 7 is much slower than the n−1/3 rate obtained in (Cl´emen¸con & Vayatis, 2008b). This is due to the following facts: • by relaxing the regularity assumptions on the optimal ROC curve, the space of target functions became much larger, • we use a relatively small search space of candidate scoring functions, namely piecewise constant scoring functions, while we used piecewise linear scoring functions before. We expect that, using nonlinear approximation techniques, the n−1/6 -rate can be significantly improved but we leave this issue open for a future work.

7

Conclusion

In this paper, we have seen how strong consistency of a piecewise constant estimate of the optimal ROC curve can be guaranteed under weak regularity assumptions on the underlying distribution of the data. Moreover, our approach leads to a strongly consistent piecewise constant scoring rule in terms of ROC curve performance. In particular, we proposed two algorithms which adapt to the importance of the local variations of the ROC curve in order to build these estimates. In the case of ROC curve optimization with a supremum norm criterion, one needs to perform both approximation and estimation in the same time. This can be done either by partitioning of the input space, as for instance using recursive partitioning techniques (we refer to our previous work (Cl´emen¸con & Vayatis, 2008d) for ranking tree-based methods, and (Cl´emen¸con & Vayatis, 2009) for ranking histogram rules). The issue there is how to split and select cells considering the goal of bipartite ranking. The approach taken in the present paper is different as it consists of selecting a partition of the false positive rate axis in the ROC space to build a finite-dimensional approximation of the optimal ROC

ROC∗ (α) − ROC(s, α) =     1−p η(X) E h(X) Y = −1 . p 1 − η(X) Then we add and subtract Q∗ (α)/(1 − Q∗ (α)) and using the fact that α = P{X ∈ Rs,α | Y = −1} = P{X ∈ R∗α | Y = −1} , we get:



ROC∗ (α) − ROC(s, α) =     1−p η(X) Q∗ (α) Y = −1 . E − h(X) p 1 − η(X) 1 − Q∗ (α)

We remove the conditioning with respect to Y = −1 and using then conditioning on X, we obtain: ROC∗ (α) − ROC(s, α) =    1 η(X) − Q∗ (α) E h(X) . p 1 − Q∗ (α) Proof of Theorem 4 This proof is also from (Cl´emen¸con & Vayatis, 2008b). In order to prove the desired result, we introduce further notation, namely

b n (R) ≤ α + φ(δ/2, n) , bα = R ∈ R : H R so that one may write bα = arg max G b n (R) . R bα R∈R

We shall consider the following events:

bα ) > α + 2φ(δ/2, n) and ΘH = H(R

bα ) < G(R∗ ) − 2φ(δ/2, n) , ΘG = G(R α

as well as

The proof then immediately follows from the complexity assumption A4 , combined with Theorem 4.



b n (R) − H(R)| > φ(δ/2, n) and ΩH = sup |H R∈R

b ΩG = sup |Gn (R) − G(R)| > φ(δ/2, n) . R∈R

The complementary event of any event E will be denoted by Ec . The matter is to establish a lower bound for the probability of occurrence of the complementary event of ΘH ∪ ΘG . We shall prove that ΘH ∪ ΘG ⊂ ΩH ∪ ΩG .

(7)

and the result will then follow from the union bound combined with McDiarmid’s concentration inequality and the control of empirical process by a Rademacher average through a double symmetrization argument (see (Boucheron et al., 2005) for details). We have, indeed, that, for all δ ∈ (0, 1), the event ΩH (respectively, the event ΩG ) occurs with probability less than δ/2. Observe first that ΩcH ∩ ΩcG ⊂ ΘcG . As a matter of fact, on the event ΩcH we have b n (R∗ ) − α ≤ sup |H b n (R) − H(R)| ≤ φ(δ/2, n), H α

It follows from Lemma 8 that, with probability at least 1 − δ: ∀j ≤ log2 n, ∀k ∈ {0, . . . , 2j − 1}, b j,k ) ≤ E(Ij,k ) + 6φ(δ/2, n) E(I b j,k ) ≥ E(Ij,k ) − 6φ(δ/2, n) . E(I We now introduce the notation for partitions σ based on the optimal ROC curve at a target tolerance of . Let  > 0 and consider the piecewise constant approximant built from the same recursive strategy as the one implemented by Algorithm 1, except that it is based on the (theoretical) error estimate E(.): Eσ (ROC∗ ), σ denoting the associated mesh grid. Choosing  = 7φ(δ/2, n), we obtain that, with probb is finer than ability larger than 1 − δ, the mesh grid σ σe1 where e 1 = e 1 (δ, n) =  + 6φ(δ/2, n) = 13φ(δ/2, n) , but coarser than σe0 with e 0 = e 0 (δ, n) =  − 6φ(δ/2, n) = φ(δ/2, n) .

R∈R

b n (R bα ) ≥ G b n (R∗ ). b α and thus, G so that we have R∗α ∈ R α In addition, since bα ) = (G(R bα ) − G b n (R bα )) + (G b n (R bα ) − G b n (R∗ )) G(R α ∗ ∗ b + (Gn (R ) − G(R )) + G(R∗ ), α

α

α

bα ) ≥ G(R∗ ) − on the event of ΩcH ∩ ΩcG we have G(R α 2φ(δ/2, n), and the latter event corresponds to ΘcG . Eventually, on the event ΩcH , we have H(R∗α )

b n (R∗ ) + sup |H(R) − H b n (R)| ≤ H α R∈R

≤ α + 2φ(δ/2, n), so that

ΩcH

⊂ ΘcH .

by:

The quantity ||Eσb  (ROC∗ )−ROC∗ ||∞ is thus bounded 1 , ||Eσe 1 (ROC∗ ) − ROC∗ ||∞ ≤ e Now we use the following decomposition:

\∗ − ROC∗ ||∞ ≤ ||ROC∗ − Eσb (ROC∗ )||∞ ||ROC  \∗ ||∞ +||Eσb  (ROC∗ )− ROC We have seen that the first term is bounded, with probability at least 1 − δ, by On the same event, we have: \∗ ||∞ ≤ max |G(Rα(k) ) − G( b R bα(k) )| ||Eσb (ROC∗ ) − ROC 

b 1≤k≤K

Proof of Theorem 5 We first prove a lemma which quantifies the uniform deviation of the empirical entropy from the true entropy over all dyadic scales.

bα(k) )| ≤ max |G(Rα(k) ) − G(R

Lemma 8 (Uniform deviation) Suppose that assumptions A3 − A4 are satisfied. Let δ ∈ (0, 1). With probability at least 1 − δ, we have: ∀n ≥ 1, b j,k ) − E(Ij,k )| ≤ 6φ(δ/2, n). sup |E(I

≤ 3φ(δ/2, n) ,

j≥0, 0≤k