Journal of Machine Learning Research 1 (2003) ?
Submitted December 30, 2002; Published ?
Efficient Margin Maximizing with Boosting∗ Gunnar R¨atsch
G UNNAR .R AETSCH @ ANU . EDU . AU
The Australian National University Canberra, ACT 0200, Australia
Manfred K. Warmuth
MANFRED @ CSE . UCSC . EDU
University of California at Santa Cruz Santa Cruz, CA 95060, USA
Editor: Leslie Pack Kaelbling
Abstract AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. It has been observed that the generalization error of the algorithm continues to improve even after all examples are classified correctly by the current signed linear combination, which can be viewed as hyperplane in feature space where the base hypotheses form the features. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even when the training error is already zero; that is, all examples are on the correct side of the hyperplane. We give a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coeffecients of the base hypotheses. The number of base hypotheses needed is essentially the same as the number needed by a previous AdaBoost related algorithm that required an explicit estimate of the achievable margin.
1. Introduction In the most common version of boosting the algorithm is given a fixed set of labeled training examples. In each stage the algorithm produces a probability weighting of the examples. It then is given a base hypothesis whose weighted error (probability of wrong classification) is slightly below 50%. This base hypothesis is used to update the distribution on the examples. Intuitively, the hard examples receive high weights. At the end of each stage the base hypothesis is added to the linear combination, and the sign of this linear combination forms the current hypothesis of the boosting algorithm. The most well known boosting algorithm is AdaBoost (Freund and Schapire, 1997). It is ”adaptive” in that the linear coefficient of the base hypothesis depends on the weighted error of the base hypothesis at the time when the base hypothesis was added to the linear combination. Earlier work on boosting includes Schapire (1992), Freund (1995). AdaBoost has two interesting properties. First, along with earlier boosting algorithms (Schapire, 1992), it has the property that its training ∗. Parts of this work was done while G. R¨atsch was at Fraunhofer FIRST Berlin and at UC Santa Cruz. G. R¨atsch was partially funded by DFG under contract JA 379/91, JA 379/71, MU 987/1-1 and by EU in the NeuroColt II project. M.K. Warmuth and visits of G. R¨atsch to UC Santa Cruz were partially funded by the NSF grant CCR-9821087. G. R¨atsch thanks S. Mika, S. Lemm and K.-R. M¨uller for discussions. c
2003 Gunnar R¨atsch & Manfred K. Warmuth.
¨ G UNNAR R ATSCH AND M ANFRED K. WARMUTH
error converges exponentially fast to zero. More precisely, if the weighted training error of the t-th base hypothesis is t = 12 − 12 γt , then an upper bound on the training error of the signed linear combination is reduced by a factor of 1 − 12 γt2 at stage t. Second, it has been observed experimentally that AdaBoost continues to “learn” even after the training error of the signed linear combination is zero (Schapire et al., 1998); that is in experiments, the generalization error is continuing to improve. When the training error is zero, then all examples are on the “right side” of the signed linear combination (viewed as a hyperplane in a feature space with threshold zero, where each base hypothesis is one feature/dimension). The margin of an example is the signed distance to the hyperplane times its ± label. As soon as the training error is zero, the examples are on the right side and all have positive margin. It has also been observed that the margins of the examples continue to increase even after the training error is zero. There are theoretical bounds on the generalization error of linear classifiers (e.g. Schapire et al., 1998, Breiman, 1999, Koltchinskii et al., 2001) that improve with the margin of the classifier which is defined as the size of the minimum margin of the examples. So the fact that the margins improve experimentally seems to explain why AdaBoost still learns after the training error is zero. There is one shortfall in this argument. AdaBoost has not been proven to maximize the margin of the final hypothesis. In fact, in our experiments in Section 5 we observe that AdaBoost does not seem to maximize the margin. Breiman (1999) proposed a modified algorithm – called ArcGV (Arcing-Game Value) – suitable for this task and showed that it asymptotically maximizes the margin (similar results are shown in Grove and Schuurmans (1998) and Bennett et al. (2000)). In this paper we give an algorithm that produces a final hypothesis whose margin lies in [ρ∗ − ν, ρ∗ ], where ρ∗ is the maximum margin achievable by any linear combination of base hypotheses and ν a precision parameter. There are two main problems regarding finding a linear combination with maximum margin. In the first, we know a close lower bound % of the value of the maximum achievable margin ρ∗ , i.e. we are given % = ρ∗ − ν, where ν ≥ 0. This problem is solved by a known adaptation of AdaBoost, called AdaBoost% (cf. R¨atsch et al. (2001), R¨atsch and Warmuth (2002)) which is guaranteed to N find a linear combination with at most 2 log hypotheses. In this paper we give a slightly improved ν2 bound on the size of the linear combination produced by AdaBoost% . The more challenging second problem consists of the case when nothing is known about ρ∗ . In a related conference paper (R¨atsch and Warmuth (2002)) we used AdaBoost% iteratively in a binary search like fashion. We required log2 (2/ν) calls to AdaBost% to obtain a value of the margin ρ in the range [ρ∗ − ν, ρ∗ ]. All but the last call to AdaBoost% were essentially aborted and the last call N provided a linear combination with at most 2 log hypotheses. ν2 In this paper we greatly simplified our answer to the case then ρ∗ is unknown. We have a new algorithm that in one pass produces a linear combination with a margin in the range [ρ∗ − ν, ρ∗ ] N and the size of this linear combination is at most 2 log . Note that the latter bound on the size of ν2 the produced linear combination is the same as the bound of the best algorithm (i.e. AdaBoost% ) which at the start is given an estimate ρ = ρ∗ − ν of the achievable margin ρ∗ . Our new algorithm uses a current estimate of the achievable margin in the computation of the linear coefficients of the base learners and it needs to know the precision parameter ν for recomputing the estimate in each iteration. Except for the algorithm presented in the previous conference paper, this is the first result on the fast convergence of a boosting algorithm to the maximum margin solution that works for all 2
E FFICIENT M ARGIN M AXIMIZATION WITH B OOSTING
ρ∗ ∈ [−1, 1]. Using previous results one can only show that AdaBoost asymptotically converges to a final hypothesis with margin at least ρ∗ /2 if ρ∗ > 0 (cf. Corollary 5) and if other subtle conditions on the chosen base hypotheses are satisfied. AdaBoost was designed to find a final hypothesis of margin at least zero. Our algorithm maximizes the margin for all values of ρ∗ . This includes the inseparable case (ρ∗ < 0), where one minimizes the overlap between the two classes. Our algorithm is also useful when the hypotheses given to the Boosting algorithm are strong in the sense that they already separate the data and have margin greater than zero, but less than one. In this case 0 < ρ∗ < 1 and AdaBoost aborts immediately because the linear coefficients for such hypotheses are unbounded. Our new algorithm also maximizes the margin by combining strong learners. The paper is structured as follows: In Section 3 we first extend the original AdaBoost algorithm leading to AdaBoost% , which requires a guess % of the maximal achievable margin ρ∗ . Then we propose AdaBoost∗ , which is similar to AdaBoost% , but adapts % based on a precision parameter ν. In Section 4, we give a more detailed analysis of both algorithms. First, we prove that if the weighted training error of the t-th base hypthesis is t = 12 − 12 γt , then an upper bound on the fraction of examples with margin smaller than % is reduced by a factor of 1 − 21 (% − γt )2 at stage t of AdaBoost% (cf. Section 4.2) (A slightly improved factor is shown for the case when ρ > 0.) To achieve a large margin, however, one needs to assume that the guess % is smaller than ρ∗ . For the latter case we prove an exponential convergence rate of AdaBoost% . Then we discuss a way of automatically tuning % depending on the error of the base hypotheses and a precision parameter ν. We show that AdaBoost∗ is able to achieve the same theoretical performance, even when not knowing the size of the maximal achievable margin. This extends our previous result (R¨atsch and Warmuth, 2002), where one had an additional log(2/ν) factor in the number of times the weak learner is called and much worse constants. We complete the paper with experiments confirming our theoretical analysis (Section 5) and a conclusion.
2. Preliminaries and Basic Notation We consider the standard two-class supervised machine learning setup: Given a set of N i.i.d. training examples (xn , yn ), n = 1, . . . , N , with xn ∈ X and yn ∈ Y := {−1, +1}, we would like to learn a function f : X → Y that is able to generalize well on unseen data that is generated from the same distribution as the training data. In the case of ensemble learning (like boosting), there is a fixed underlying set of base hypotheses H := {h | h : X → [−1, 1]} from which the ensemble is built. For now we only assume that H is finite, but we will show in Section 4.5 that this assumption can be dropped in most cases and that all of the following analysis also applies to the case of infinite hypothesis sets. Boosting algorithms iteratively form linear combinations of hypotheses from H. In each iteration t, a base hypothesis ht ∈ H is added to the linear combination. The algorithm that selects the base hypothesis in each iteration is called weak learner. Most ensemble algorithms generate a combined hypothesis of the following form X αt P ht (x) . f˜α (x) = sign fα (x), where fα (x) = r αr t For AdaBoost, it has been shown that if the weak learner selects base hypotheses of weighted error bounded slightly away from 50% then the combined hypothesis is consistent with the training set in 3
¨ G UNNAR R ATSCH AND M ANFRED K. WARMUTH
a small number of iterations (Freund and Schapire, 1997). We will discuss bounds on the number of needed base hypotheses in detail in Section 4. It suffices to say at this point that the size of the combined hypothesis enters into the PAC analysis of the generalization error (Schapire, 1992, Freund, 1995). In more recent research Schapire et al. (1998) it was also shown that a bound on the generalization error decreases with the size of the so-called margin of the final hypothesis f . The margin of a single example (xn , yn ) w.r.t. f is defined as yn fα (xn ). Thus the margin quantifies by how far this example is on the yn side of the hyperplane f˜. The margin of the combined hypothesis f is the minimum margin of all N examples. Thus the goal of this paper is to find a small linear combination of base hypotheses from H with margin close to the maximum achievable margin.
3. Marginal Boosting and AdaBoost∗ The original AdaBoost was designed to find a consistent hypothesis f˜ (i.e. a linear combination f with margin greater zero). We start with a slight modification of AdaBoost, which finds (if possible) a linear combination of base learners with margin %, where % is pre-specified (cf. Algorithm 1).1 We call this algorithm AdaBoost% , as it naturally generalizes AdaBoost for the case when the target margin is %. The original AdaBoost algorithm now becomes AdaBoost0 . Algorithm 1 – The AdaBoost% algorithm – with margin parameter ρ 1. Input: S = h(x1 , y1 ), . . . , (xN , yN )i, No. of Iterations T , margin target % 2. Initialize: d1n = 1/N for all n = 1 . . . N 3. Do for t = 1, . . . , T , (a) Train classifier on {S, dt } and obtain hypothesis ht : x 7→ [−1, 1] (b) Calculate the edge γt of ht : γt =
N X
dtn yn ht (xn ).
n=1
(c) if |γt | = 1, then αr = 0, for r = 1, . . . , t − 1; αt = sign(γt ); break 1 1 + γt 1 1+% (d) Set αt = log − log . 2 1 − γt 2 1−% (e) Update weights:
dt+1 n
=
dtn exp {−αt yn ht (xn )} /Zt ,
s.t.
N X
dt+1 = 1. n
n=1
4. Output: f (x) =
T X t=1
α P t ht (x) r αr
The algorithm AdaBoost% is already known as unnormalized Arcing (Breiman, 1999) or AdaBoost-type Algorithm (R¨atsch et al., 2001). Moreover, it is related to algorithms proposed in Freund 1. The original AdaBoost algorithm was formulated in terms of weighted training error t of a base hypothesis. Here we use an equivalent more convenient formulation in terms of the edge γt , where t = 12 − 12 γt (cf. Section 4.1).
4
E FFICIENT M ARGIN M AXIMIZATION WITH B OOSTING
and Schapire (1999) and Zhang (2002). The only difference to AdaBoost is the choice of the hypothesis coefficients αt : An additional term − 12 log 1+% 1−% appears in the expression for αt . This term vanishes when ρ = 0. The constant % might be seen as a guess of the maximum margin ρ∗ . If % is chosen properly (slightly below ρ∗ ), then AdaBoost% will converge exponentially fast to a combined hypothesis with a near maximum margin (see Section 4.2 for details). The following example illustrates how AdaBoost% works. Assume the weak learner returns the constant ht (x) ≡ 1. The error of this hypothesis is the sum of all negative weights, i.e. P hypothesis t t = yn =−1 dn , its edge is γt = 1 − 2t . The parameter αt is chosen so that the edge of ht with respect to the new distribution is exactly % (instead of 0 as for the original AdaBoost). Actually, the given choice of αt assures that this edge is % only for ±1-valued base hypotheses. For a more general base hypothesis ht with continuous range [−1, +1], choosing αt such that Zt as a function of αt is minimized, guarantees that the edge of ht with respect to distribution dt+1 is % (see Schapire and Singer, 1999, for a similar discussion). Choosing αt as in step 3d approximately minimizes Zt when the range of ht is [−1, +1]. In Kivinen and Warmuth (1999), Lafferty (1999) the standard boosting algorithms are interpreted as approximate solutions to the following optimization problem: choose a distribution d of maximum entropy subject to the constraints that the edges of the previous hypotheses are equal zero. In this paper the same reasoning instead uses the inequality constraints that the edges of the previous hypotheses are at most %. The αt ’s function as Lagrange multipliers for these inequality constraints. Since g(x) = 12 ln 1+x 1−x is an increasing function, αt =
1 1 + γt 1 1+% log − log ≥ 0 2 1 − γt 2 1−%
iff
γt ≥ ρ .
(1)
Note, when % = 0 then adding ht or −ht leads to the same distribution dt+1 . This symmetry is broken for % 6= 0. Since one does not know the value of the optimum margin ρ∗ beforehand, one also needs to find ρ∗ . In R¨atsch and Warmuth (2002), we presented the Marginal AdaBoost algorithm which ∗ constructs a sequence {%r }R r=1 converging to ρ : A fast way to find a real value up to a certain accuracy ν in the interval [−1, 1] is to use a binary search – one needs only log2 (2/ν) search steps.2 Thus the previous Marginal AdaBoost algorithm uses AdaBoost%r (Algorithm 1) to decide whether the current guess %r is larger or smaller than ρ∗ . Depending on the outcome, %r can be chosen so that the region of uncertainty for ρ∗ is roughly cut in half. However, in the previous algorithm all but the last of the log2 (2/ν) iterations of AdaBoost% are aborted and this is inefficient. In this paper we propose a different algorithm, called AdaBoost∗ . Here ν > 0 is a precision parameter. The algorithm finds a linear combination whose margin lies in the range [ρ∗ − ν, ρ∗ ]. As Arc-GV (Breiman, 1999), the new algorithm essentially runs AdaBoost% once but instead of having a fixed margin estimate %, it updates % in an appropriate way. We shall show guarantees for our algorithm AdaBoost∗ which are not known for Arc-GV. The latter produces an essentially3 monotonically increasing sequence of margin estimates, while in AdaBoost∗ we use a monotonically decreasing sequence. The improved sequence of estimates is based on two new theoretical insights, which will be developed in the next section. 2. If one knows that ρ∗ ∈ [a, b], one needs only log2 ((b − a)/ν) steps. 3. In the original formulation the sequence was not necessarily increasing, but R¨atsch (2001) showed that it leads to the same result and easier proofs if one restricts it to be monotonically increasing.
5
¨ G UNNAR R ATSCH AND M ANFRED K. WARMUTH
Surprisingly, the size of the linear combination of the new AdaBoost∗ algorithm (see Algorithm 2 for pseudo-code) is also as small as possible. After finding a margin estimate ρ in the range [ρ∗ − ν, ρ∗ ], the last call to AdaBoost% executed by the previous algorithm produced a linear comN . Surprisingly, the size of the linear combination of the new one-pass bination of size at most 2 log ν2 N ∗ AdaBoost algorithm is also at most 2 log . ν2 Algorithm 2 – The AdaBoost∗ algorithm – with accuracy parameter ν 1. Input: S = h(x1 , y1 ), . . . , (xN , yN )i, No. of Iterations T , desired accuracy ν 2. Initialize: d1n = 1/N for all n = 1 . . . N 3. Do for t = 1, . . . , T , (a) Train classifier on {S, dt } and obtain hypothesis ht : x 7→ [−1, 1] (b) Calculate the edge γt of ht : γt =
N X
dtn yn ht (xn ).
n=1
(c) if |γt | = 1, then αr = 0, for r = 1, . . . , t − 1; αt = sign(γt ); break (d) %t = ( min γr ) − ν r=1,...,t
(e) Set αt =
1 1 + γt 1 1 + %t log − log . 2 1 − γt 2 1 − %t
(f) Update weights: dt+1 = dtn exp {−αt yn ht (xn )} /Zt , s.t. n
N X
dt+1 = 1. n
n=1
4. Output: f (x) =
T X t=1
α P t ht (x) r αr
4. Detailed Analysis 4.1 Weak learning and margins The standard assumption made on the weak learning algorithm for the PAC analysis of Boosting algorithm is that the weak learner returns a hypothesis h from a fixed set H that is slightly better than random guessing.4 More formally this means that the error rate is consistently smaller than 1 1 2 . Note that the error rate of 2 could easily be reached by a fair coin, assuming both classes have the same prior probabilities. More formally, the error of a hypothesis is defined as the fraction of examples that are misclassified. In Boosting this is extended to weighted example sets and one defines the error as N X h (d) = dn I(yn 6= sign(h(xn ))), n=1
4. In the PAC setting, the weak learner is allowed to fail with probability δ. For the sake of simplicity we assume the weak learner never fails. Our algorithm can be extended to this case.
6
E FFICIENT M ARGIN M AXIMIZATION WITH B OOSTING
where h is the hypothesis returned by the weak learner and I is the indicator function with I(true) = 1 I(false) = 0. The weighting d = [d1 , . . . , dN ] of the examples is such that dn ≥ 0 and Pand N A more convenient quantity for measuring the quality of the hypothesis h is the n=1 dn = 1. P N edge γh (d) = n=1 dn yn h(xn ), which is an affine transformation of h (d) in the case when h(x) ∈ {−1, +1}: h (d) = 12 − 12 γh (d). Recall from Section 2 that the margin of a given example (xn , yn ) is defined as yn fα (xn ). Also H is the set from which the weak learner chooses its base hypotheses. Suppose we would combine all possible hypotheses from H (assuming for the moment that H is finite; cf. Section 4.5), then the following well-known theorem establishes the connection between margins and edges (first seen in connection with Boosting in Freund and Schapire, 1996, Breiman, 1999): Theorem 1 (Min-Max-Theorem, von Neumann (1928)) ∗
γ := min max d
h∈H
N X
dn yn h(xn )
=
max min yn α
n=1
n=1,...,N
|H| X
αk hk (xn ) =: ρ∗ ,
(2)
k=1
where d ∈ P N , α ∈ P |H| . Here P k denotes the k-dimensional probability simplex. Thus, the minimal edge γ ∗ that can be achieved over all possible weightings d of the training set is equal to the maximal margin ρ∗ of any linear combination of hypotheses from H. Also, for any (non-optimal) weightings d and α we always have max γh (d) h∈H
>
γ ∗ = ρ∗
>
min yn fα (xn ).
n=1,...,N
In particular, if the weak learning algorithm guarantees to return a hypothesis with edge at least γ for any weighting (in particular for the optimal weighting d), then γ ∗ ≥ γ and by the above duality there exists a combined hypothesis with margin at least γ. If γ = γ ∗ , i.e. if the lower bound γ is largest possible, then there exists a combined hypothesis with margin exactly γ = ρ∗ (only using hypotheses that are actually returned by the weak learner in response to certain weightings of the examples). From this discussion we can derive a sufficient condition on the weak learning algorithm to reach the maximal margin (for the case when H finite): If the weak learner returns hypotheses whose edges are at least γ ∗ , then there exists a linear combination of these hypotheses that has margin γ ∗ = ρ∗ . We will prove later that our AdaBoost∗ algorithm efficiently finds a linear combination (cf. Theorem 6) with margin close to ρ∗ . Constraining the edges of the previous hypotheses to equal zero (as done in the totally corrective algorithm of Kivinen and Warmuth (1999)) leads to a problem if there is no solution satisfying these constraints. There one considers the edge of a restricted problem using only the hypotheses that have ˆ t = {h1 , . . . , ht }. Because of the above duality, been generated so far, i.e. H γˆt := min max γh d
ˆt h∈H
≤
γ ∗ = ρ∗ .
Thus, if ρ∗ > 0, then the equality constraints on the edges are eventually not satisfiable (i.e. when d is minimizing the above and γˆt > 0). N P In contrast AdaBoost∗ is motivated by a system of inequality constraints yn dn ht (xn ) ≤ %, n=1
where % is adapted. Again, if % < ρ∗ , then the current system of inequalities may not have a solution 7
¨ G UNNAR R ATSCH AND M ANFRED K. WARMUTH
(and the Lagrange multipliers may diverge to infinity). In our new algorithm AdaBoost∗ we start with % large and decrease it when necessary. Our choice of % is always at least ρ∗ − ν, where ν is the precision parameter. 4.2 Convergence properties of AdaBoost% We begin by analyzing a generalized version of AdaBoost% , where % is not fixed but could be adapted in each iteration. This extension will be necessary for the later analysis of AdaBoost∗ . We consider sequences {%t }Tt=1 , which might either be specified before or while running the algorithm the algorithm. For instance, in the algorithm Arc-GV (Breiman, 1999) chooses %t as min yn fαt−1 (xn ). n=1,...,N
It has been shown in Breiman (1999) that Arc-GV is asymptotically converging to the maximum margin solution (see discussion in next section). In the following we are answering the question how good AdaBoost{%t } is able to increase the margin and bound the fraction of examples, which have a margin smaller than say ρ. We start with the following useful lemma (Schapire et al., 1998, Schapire and Singer, 1999): Lemma 2 For any convex combination fα (x) =
T X t=1
N
1X I (yn fα (xn ) ≤ ρ) ≤ N n=1
T Y
! Zt
exp
t=1
α P t ht (x), r αr
( T X t=1
) ραt
=
T Y
exp {ραt + log Zt }
(3)
t=1
where Zt is as in step (3e) of AdaBoost% . The proof directly follows from a simple extension of Theorem 1 in Schapire and Singer (1999) (see also Schapire et al. (1998)). This leads to a result generalizing Theorem 5 in (Freund and Schapire, 1997) for the case when the target margin is not zero. Lemma 3 (R¨atsch et al. (2001)) Let γt be the edge of ht in the t-th step of AdaBoost{%t } . Assume −1 ≤ %t ≤ γt . Then for all ρ ∈ [−1, 1], 1 + %t 1−ρ 1 − %t 1+ρ exp {ραt + log Zt } ≤ exp − log − log . (4) 2 1 + γt 2 1 − γt The algorithm makes progress, if the right hand side (rhs.) of (4) is smaller than one. Suppose we would like the reach a margin ρ on all training examples, where we obviously need to assume ρ ≤ ρ∗ . Then the question arises, which sequence of {%t }Tt=1 one should use to find such combined hypothesis in as few iterations as possible. Since the rhs. of (4) is minimized for %t = ρ (independent of γt ) one should always use this constant choice. We therefore assume for the following analysis that % is held constant in a run of AdaBoost% . Let us reconsider (4) of Lemma 3 for the special case %t = % ≡ ρ: exp {%αt + log Zt } ≤ exp {−∆2 (%, γt )} ,
(5)
1+% 1−% 1−% where ∆2 (%, γt ) := 1+% 2 log 1+γt + 2 log 1−γt is the binary relative entropy. This means that the rhs. of (3) is reduced by a factor exactly exp(−∆2 (%, γt )), which can be bounded from above
8
E FFICIENT M ARGIN M AXIMIZATION WITH B OOSTING
by inspecting the Taylor expansion of exp(−∆2 (·, ·)) in the second argument and noticing that all higher order terms are negative: exp(−∆2 (%, γt )) ≤ 1 −
1 (% − γt )2 2 1 − %2
(6)
where we need to assume γt ≥ % and 0 ≤ % ≤ 1. Note that this is the same form of bound one also obtains for original AdaBoost (i.e. % = 0), where the fraction of points with margin at most zero is 1 reduced by at least 1 − 12 γt2 , if γt > 0. The additional factor 1−% 2 in (6), speeds up the convergence, if % 0. Now we determine an upper bound on the number of iterations needed by AdaBoost% for achieving a margin of % on all examples, given that the maximum margin is ρ∗ : Corollary 4 Assume the weak learner always achieves an edge γt ≥ ρ∗ . If 0 ≤ % ≤ ρ∗ − ν, ν > 0, then AdaBoost% will converge to a solution with margin of at least % on all examples in at most 2) d 2 log(Nν)(1−% e + 1 steps. 2 Proof We use Lemma 2 and (6) and can bound T N T Y 1 X 1 (% − γt )2 1 ν2 I (yn f (xn ) ≤ %) ≤ 1− ≤ 1− . N 2 1 − %2 2 1 − %2 n=1
t=1
The margin is at'least % for all examples, if the rhs. is smaller than N1 ; hence after at most & m l 2) log(N ) 2 + 1 ≤ 2 log(N )(1−% + 1 iterations, which proves the statement. 2 1 ν ν − log 1− 2
1−%2
A similar result is found for % < 0 (R¨atsch, 2001), but without the additional factor 1 − %2 ; in that case the number of iterations is bounded by d2 log(N )/ν 2 e + 1. 4.3 Asymptotical Margin of AdaBoost% With the methods shown so far, we can also analyze to what value the maximum margin of the original AdaBoost algorithm converges to asymptotically. First, we state a lower bound on the margin that is achieved by AdaBoost% . There is a gap between this lower bound and the upper bound of Theorem 1. In a second part we consider an experiment that shows that depending on some subtle properties of the weak learner, the margin of combined hypotheses generated by AdaBoost can converge to quite different values (while the maximum margin is kept constant). We observe that the previously lower bound on the margin is almost tight in empirical cases. As long as each factor in the rhs. of Eq. (3) is smaller than 1, the bound decreases. If the factor is at most 1 − µ (µ > 0), then the rhs. converges exponentially fast to zero. The following corollary considers the asymptotical case and gives a lower bound on the margin. Corollary 5 (R¨atsch (2001)) Assume AdaBoost% generates hypothesis h1 , h2 , . . . with edges γ1 , γ2 , . . . and coefficients α1 , α2 , . . .. Let γ = mint=1,2,... γt and assume γ > %. Furthermore, let ρˆt = Pt minn=1,...,N yn r=1 αt ht (xn ) be the achieved margin in the t-th iteration and ρˆ = supt=1,2,... ρˆt . Then the margin ρˆ of the combined hypothesis is bounded from below by ρˆ ≥
log(1 − %2 ) − log(1 − γ 2 ) . 1+% log 1+γ − log 1−γ 1−% 9
(7)
¨ G UNNAR R ATSCH AND M ANFRED K. WARMUTH
From (7) one can understand the interaction between % and γ: if the difference between γ and % is small, then the rhs. of (7) is small. Thus, if % with % ≤ γ is large, then ρˆ must be large, i.e. choosing a larger % results in a larger margin on the training examples. By using Taylor expansion of the rhs. of (7), one sees that the margin is lower bounded by γ+% 2 . This known lower bound (Breiman, 1999, Theorem 7.2) is greater than % if γ > %. However, in the Section 4.1 we reasoned that ρ∗ ≥ γ. Thus if %, the parameter of AdaBoost% , is chosen too small, then we guarantee only a suboptimal asymptotical margin. In the original formulation of AdaBoost we have % = 0 and we guarantee only that AdaBoost0 achieves a margin 1 ∗ of at least γ+% 2 = 2 γ. This gap in the theory motivates our AdaBoost algorithm. 4.3.1 E XPERIMENTAL I LLUSTRATION OF C OROLLARY 5 To illustrate that the above-mentioned gap, we perform an experiment showing how tight (7) can be. We analyze two different settings: (i) the weak learner selects the hypothesis with largest edge over all hypotheses (i.e. the best case) and (ii) the hypothesis with minimal edge among all hypothesis with edge larger than ρ∗ (i.e. the worst case). Note that for Corollary 5 holds for both cases since the weak learner is allowed to return any hypothesis with edge larger than ρ∗ .
Figure 1: Achieved margins of AdaBoost% using the best (green) and the worst (red) selection on random
data for % = 0 [left] and % = 13 [right]: On the abscissa is the maximal achievable margin ρ∗ and on the ordinate the margin achieved by AdaBoost% for one data realization. For comparison the line y = x (“upper bound”) and the bound (7) (“lower bound”) are plotted. On the interval [%, 1], there is a clear gap between the performance of the worst and best selection strategies. The margin of the worst strategy is tightly lower bounded by (7) and the best strategy has near maximal margin. If % is chosen slightly below the maximum achievable margin then this gap is reduced to 0.
We use random data with N training examples, where N is drawn uniformly between 10 and 200. The labels are drawn at random from a binomial distribution with equal probability. We use a hypothesis set with 104 random hypotheses with range {+1, −1}. We first choose a parameter p uniformly in (0,1). Then the label of each hypothesis on each example is chosen to agree with the label of the example with probability p.5 First we compute the solution ρ∗ of the margin-LP 5. We do not allow duplicate hypotheses and hypotheses that agrees with the labels of all examples.
10
E FFICIENT M ARGIN M AXIMIZATION WITH B OOSTING
problem via the left hand side (lhs.) of (2)). Then we compute the combined hypothesis generated by AdaBoost% after 104 iterations for % = 0 and % = 13 using the best and the worst selection strategy, respectively. The latter depends on ρ∗ . We chose 300 hypothesis sets based on 300 random draws of p. The random choice of p ensures that there are cases with small and large optimal margins. For each hypothesis set we did two runs of AdaBoost% using the best and worst selection strategy. The result of each run is represented as a point in Figure 1. The abscissa is the maximal achievable margin ρ∗ for each run. The ordinate is the margin of AdaBoost% using the best (green) and the worst strategy (red). We observe a great difference between both selection strategies. Whereas the margin of the worst strategy is tightly lower bounded by (7), the best strategy has near maximal margin. These experiments show that one obtains different results by changing the selection strategy of the weak learning algorithm. Our lower bound holds for both selection strategies. The looseness of the bounds is indeed a problem, as one is not able to predict where AdaBoost% is converging to.6 However, note that moving ρ closer to ρ∗ reduces the gap (see also Figure 1 [right]). 4.3.2 D ECREASING THE S TEP S IZE Breiman (1999) conjectured that the inability of maximizing the margin is due to the fact that the normalized hypothesis coefficients may “circulate endlessly through the convex set”, which is defined by the lower bound on the margin. In fact, motivated from our previous experiments, it seems possible to implement a weak learner that appropriately switches between optimal and worst case performance, leading to non-convergent normalized hypothesis coefficients. Rosset et al. (2002) have shown that AdaBoost with infinitesimal small step sizes may maximize the margin, if the weak learner uses the best selection strategy (similar to what we found empirically for normal (i.e. finite) step sizes). This motivates us to analyze AdaBoost% with step sizes chosen as follows: η 1 + γt η 1+% α ˆ t = ηαt = log − log , 2 1 − γt 2 1−% for some η > 0. Obviously, for η = 1 we recover AdaBoost% . Following the same proof technique as for Corollary 5, we can show that in this case holds (same conditions as in Corollary 5) ρˆ ≥
− log ((1 + γ) exp(−ˆ α) + (1 − γ) exp(ˆ α)) , α ˆ
η 1+% where α ˆ = η2 log 1+γ ˆ = γ (interestingly, this is 1−γ − 2 log 1−% . Note that if η goes to zero, then ρ independent of the choice of %). Thus if the weak learner always returns hypotheses with edges γt ≥ ρ∗ (t = 1, 2, . . .), where ρ∗ is the maximum margin, then by the Min-Max Theorem, the margin is maximized when η goes to zero. However, we have no guarantees on the convergence speed.
4.4 Convergence of AdaBoost∗ The AdaBoost∗ algorithm is based on two insights: • According to the discussion after Lemma 3, AdaBoost% converges fastest to a combined hypothesis with margin ρ∗ − ν, when one chooses %t as close as possible to ρ∗ − ν. 6. One might even be able to construct cases where the outputs are not at all converging.
11
¨ G UNNAR R ATSCH AND M ANFRED K. WARMUTH
• For distribution on the examples that are hard for the weak learner, the edge γt will be close to ρ∗ . The idea is that by choosing %t = (minr=1,...,t γt ) − ν we concentrate on the hardest distribution we generated so far and can so find a close over-estimate of ρ∗ − ν. This then helps converging faster to a larger margin and also generating distributions where the weak learner has to return low edges. Note that if the weak learner always returns hypotheses with edge γt = ρ∗ (the worst case in the sense of Section 4.3.1), then we immediately have ρt = ρ∗ − ν and the same smallest step size is taken in every iteration (αt is monotonically increasing in γt ). The size of the step is given by the desired accuracy. This matches the intuition about decreasing the step size to achieve larger margins discussed in Section 4.3.2. We will now state and prove our main theorem: Theorem 6 Assume the weak learner always achieves an edge γt ≥ ρ∗ . Then AdaBoost∗ (Algorithm 2) will find a combined hypothesis f that maximizes the margin up to accuracy ν in at most ) ) d 2 log(N e + 1 calls of the weak learner. The final hypothesis combines at most d 2 log(N + 1e base ν2 ν2 hypotheses. Proof Let ρ = ρ∗ − ν be the margin that we would like to achieve. By assumption on the performance of the weak learner, we have γt ≥ ρ∗ and thus ρ = ρ∗ − ν ≤ γt − ν for t = 1, . . . , T . By construction (step 3d in Algorithm 2), %t ≤ γt − ν and hence ρ ≤ %t ≤ γt − ν for each iteration. By Lemma 2 and Lemma 3 N T Y 1 X 1+ρ 1 + %t 1−ρ 1 − %t I (yn f (xn ) ≤ ρ) ≤ exp − log − log N 2 1 + γt 2 1 − γt n=1
t=1
We now rewrite the rhs. using αt = =
1 2
T Y t=1
1+%t 1 t log 1+γ 1−γt − 2 log 1−%t :
1 1 + %t 1 1 − %t exp − log − log + ραt 2 1 + γt 2 1 − γt
By (1), αt ≥ 0 since %t ≤ γt . Thus by replacing ρ by its upper bound %t we get: T Y 1 + %t 1 − %t 1 − %t 1 + %t ≤ exp − log − log 2 1 + γt 2 1 − γt t=1
Finally, by (6) we have: =
T Y
exp(−∆(%t , γt )) ≤
t=1
1 1 − ν2 2
T .
The margin is at least ρ for all examples, if the rhs. is smaller than N1 ; hence after at most ' & 2 log(N ) log(N ) +1≤ +1 ν2 − log 1 − 12 ν 2 iterations, which proves the Theorem. Theorem 6 could be improved by a factor (1 − %2t ) (in each iteration), if one assumes %t ≥ 0, since we have not exploited the additionalmfactor (1 − %2t ) as in Lemma 3. Since %t ≥ ρ∗ − ν, one would l log(N )(1−(ρ∗ −ν)2 ) obtain the bound , if ρ∗ ≥ ν. This factor will only matter for very large margins. ν2 12
E FFICIENT M ARGIN M AXIMIZATION WITH B OOSTING
4.5 Infinite Hypothesis Sets So far we have implicitly assumed that the hypothesis space is finite. In this section we will show that this assumption is (often) not necessary. Also note, if the output of the hypotheses is discrete, the hypothesis space is effectively finite (R¨atsch et al., 2002). For infinite hypothesis sets, Theorem 1 can be restated in a weaker form as: Theorem 7 (Weak Min-Max, e.g. Nash and Sofer (1996)) γ ∗ := min sup d
N X
yn h(xn )dn
≥
sup min yn α n=1,...,N
h∈H n=1
X
αq hq (xn ) =: ρ∗ ,
(8)
q:αq >0
where d ∈ P N , α ∈ P |H| with finite support. P We call Γ = γ ∗ − ρ∗ the “duality gap”. In particular for any dP ∈ P N : suph∈H N n=1 yn h(xn )dn ≥ ∗ |H| γ and for any α ∈ P with finite support: minn=1,...,N yn q:αq ≥0 αt hq (xn ) ≤ ρ∗ . In theory the duality gap may be nonzero. However, Lemma 3 and Theorem 6 do not assume finite hypothesis sets and show that the margin will converge arbitrarily close to ρ∗ , as long as the weak learning algorithm can return a hypothesis in each iteration that has an edge not smaller than ρ∗ . In other words, the duality gap may result from the fact that the sup on the left side can not be replaced by a max, i.e. there might not exists a single hypothesis h with edge larger or equal to ρ∗ . By assuming that the weak learner is always able to pick good enough hypotheses (≥ ρ∗ ) one automatically gets that Γ = 0 (by Lemma 3). Under certain conditions on H this maximum always exists and strong duality holds (for details see e.g. R¨atsch et al., 2002, R¨atsch, 2001, Hettich and Kortanek, 1993, Nash and Sofer, 1996): Theorem 8 (Strong Min-Max) If the set label vectors {(h(x1 ), . . . , h(xN )) | h ∈ H} is compact, then Γ= 0. In general, this requirement can be fulfilled by the weak learning algorithms whose outputs continuously depend on the distribution d. Furthermore, the outputs of the hypotheses need to be bounded (cf. step 3a in AdaBoost% ). The first requirement might be a problem with weak learning algorithms that are some variants of decision stumps or decision trees. However, there is a simple ˆ one adds all hytrick to avoid this problem: Roughly speaking, at each point with discontinuity d, s s ∞ potheses to H that are limit points of L(S, d ), where {d }s=1 is an arbitrary sequence converging ˆ and L(S, d) denotes the hypothesis returned by the weak learning algorithm for weighting d to d and training sample S (R¨atsch, 2001). This procedure make H closed.
5. Experimental Illustration of Marginal AdaBoost First of all, we would like to note that we are aware of the fact that maximizing the margin of the ensemble does not lead in all cases to an improved generalization performance. For fairly noisy data sets even the opposite has been reported (cf. Quinlan, 1996, Breiman, 1999, Grove and Schuurmans, 1998, R¨atsch et al., 2001). Also, Breiman (1998) reported an example where the margins of all examples are larger in one ensemble than another and the latter generalized considerably better. There are theoretical bounds on the generalization error of linear classifiers that improve with the margin. Hence, we expect that one should be able to measure differences in the generalization 13
¨ G UNNAR R ATSCH AND M ANFRED K. WARMUTH
3 2.5 2 1.5 1 0.5 0 −0.5 −1
0
1
2
3
4
Figure 2: The two discriminative dimensions of our separable one hundred dimensional data set.
error, if one function approximately maximizes the margin while another function does not (i.e. has a small margin). Similar results have been obtained in Schapire et al. (1998) on a multi-class optical character recognition problem. Here we report experiments on artificial data to (a) illustrate how our algorithm works and (b) how it compares to AdaBoost. Our data is 100 dimensional and contains 98 nuisance dimensions with uniform noise. The other two dimensions are plotted exemplary in Figure 2. For training we use only 100 examples and there is obviously the need to carefully control the capacity of the ensemble. As the weak learning algorithm we use C4.5 decision trees provided by Quinlan (1992) using an option to control the number of nodes in the tree. We have set it such that C4.5 generates trees with about three nodes. Otherwise, the weak learner often classifies all training examples correctly and over-fits the data already. Furthermore, since in this case the margin is already maximal (equal to 1), boosting algorithms would stop since γ = 1. We therefore need to limit the complexity of the weak learner, in good agreement with the bounds on the generalization error (Schapire et al., 1998). Moreover, we have to deal with the fact that C4.5 cannot use weighted samples. We therefore use weighted bootstrapping. However, this amplifies the problem that the resulting hypotheses might in some cases have a rather small edge (smaller than the maximal margin), which should not happen (according to the Min-Max-Theorem), if the weak learner performs optimal. We deal with this problem by repeatedly calling C4.5 if the edge is smaller than the margin of the current linear combination. Furthermore, for AdaBoost∗ , a small edge of one hypothesis can spoil the margin estimate %t . We reduce this problem by resetting %t = ρt + ν, whenever %t ≤ ρt , where ρt is the margin of the currently combined hypothesis. In Figure 3 we see a typical run of AdaBoost, Marginal AdaBoost, AdaBoost∗ and Arc-GV for ν = 0.1. For comparison we plot the margins of the hypotheses generated by AdaBoost (cf. Figure 3 (left)). One observes that it is not able to achieve a large margin efficiently (ρ = 0.37 after 1000 iterations). 14
E FFICIENT M ARGIN M AXIMIZATION WITH B OOSTING
Marginal AdaBoost as proposed in R¨atsch and Warmuth (2002) proceeds in stages and first tries to find an estimate of the margin using a binary search. It calls AdaBoost% three times. The first call of AdaBoost% for γ = 0 already stops after four iterations, since it has generated a consistent combined hypothesis. The lower bound l on ρ∗ as computed by our algorithm is l = 0.07 and the upper bound u is 0.94. The second time % is chosen to be in the middle of the interval [l, u] and AdaBoost% reaches the margin of % = 0.51 after 80 iterations. The interval is now [0.51, 0.77]. Since the length of the interval u − l = 0.27 is small enough, Marginal AdaBoost leaves the loop through an exit condition, calls AdaBoost% the last time for % = u − ν = 0.41 and finally achieves a margin of ρ = 0.55. In a run of Arc-GV for thousand iterations we observe a margin of the combined hypothesis of 0.53, while for our new algorithm, AdaBoost∗ , we find 0.58. In this case the margin for AdaBoost∗ is largest among all methods after one thousand iterations. It starts with slightly lower margins in the beginning, but then catches up due the better choice of the margin estimate.
Figure 3: Illustration of the achieved margin of AdaBoost0 (left), Marginal AdaBoost (middle), Arc-GV, and AdaBoosta st (right) at each iteration. Marginal AdaBoost calls AdaBoost% three times while adapting % (dash-dotted). We also plot the values for l and u as in Marginal AdaBoost (dashed). (For details see R¨atsch and Warmuth, 2002) AdaBoost∗ achieves larger margins than AdaBoost. Compared to Arc-GV it starts slower, but then catches up in the later iterations. Here the correct choice of the parameter % is important.
Egen ρ
C4.5 7.4 ± 0.11% —
AdaBoost 4.0 ± 0.11% 0.31 ± 0.01
Marginal AdaBoost 3.6 ± 0.10% 0.58 ± 0.01
AdaBoost∗ 3.5 ± 0.10% 0.55 ± 0.01
Table 1: Estimated generalization performances and margins with confidence intervals for decision trees (C4.5), AdaBoost, Marginal AdaBoost and AdaBoost∗ on the toy data. All numbers are averaged over 200 splits into 100 training and 19900 test examples.
In Table 1 we see the average performances of the four classifiers. For AdaBoost and AdaBoost∗ we combined 200 hypotheses for the final prediction. For Marginal AdaBoost we use ν = 0.1 and let the algorithm combine only 200 hypotheses for the final prediction to get a fairer comparison. We see a large improvement of all ensemble methods compared to the single classifier. There is also a slight, but – according to a t-test with confidence level 98% – significant difference between the generalization performances of AdaBoost and Marginal AdaBoost as well as AdaBoost and AdaBoost∗ . Note also that the margins of the combined hypothesis achieved by Marginal AdaBoost and AdaBoost∗ are on average almost twice as large as for AdaBoost. The differences in generalization performance between AdaBoost∗ and Marginal AdaBoost is not statistically significant. 15
¨ G UNNAR R ATSCH AND M ANFRED K. WARMUTH
The differences between the achieved margins of both algorithms seem slightly significant (96%). The slightly larger margins generated by Marginal AdaBoost can be attributed to the fact that it uses much more calls to the weak learner than AdaBoost∗ and after an estimate of the achievable margin is available, it starts optimizing the linear combination using this estimate. It would be natural to use a two-pass algorithm: In the first pass use AdaBoost∗ to get an estimate % of the margin in the interval [ρ∗ − ν, ρ∗ ] and then use this estimate in a run of AdaBoost% . The hypothesis produced in the second pass should be of better quality, i.e. it should have a larger margin and use fewer bases hypotheses.
6. Conclusion We have analyzed a generalized version of AdaBoost in the context of large margin algorithms. From von Neumann’s Min-Max theorem the maximal achievable margin ρ∗ is at least γ, if the weak learner always returns a hypothesis with weighted classification error less than 12 − 12 γ. The asymptotical analysis lead us to a lower bound on the margin of the combined hypotheses generated by AdaBoost% in the limit, which was shown to be rather tight in empirical cases. Our results indicate that AdaBoost generally does not maximize the margin, but achieves a reasonable large margin. To overcome these problems we proposed an algorithm for which we have shown fast convergence to the maximum margin solution. This is achieved by decreasing % iteratively, such that the gap between the best and the worst case becomes arbitrarily small. In our analysis we did not need to assume additional properties of the weak learning algorithm. In a simulation experiment we have illustrated the validity of our analysis.
References K.P. Bennett, A. Demiriz, and J. Shawe-Taylor. A column generation algorithm for boosting. In P. Langley, editor, Proceedings, 17th ICML, pages 65–72, San Francisco, 2000. Morgan Kaufmann. L. Breiman. Are margins relevant in voting? Talk at the NIPS’98 workshop on Large Margin Classifiers, December 1998. L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11(7):1493–1518, 1999. Also Technical Report 504, Statistics Department, University of California Berkeley. Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121(2): 256–285, September 1995. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148–146. Morgan Kaufmann, 1996. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997. Y. Freund and R.E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79–103, 1999. A.J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artifical Intelligence, 1998. 16
E FFICIENT M ARGIN M AXIMIZATION WITH B OOSTING
R. Hettich and K.O. Kortanek. Semi-infinite programming: Theory, methods and applications. SIAM Review, 3:380–429, September 1993. J. Kivinen and M. Warmuth. Boosting as entropy projection. In Proc. 12th Annu. Conference on Comput. Learning Theory, pages 134–144. ACM Press, New York, NY, 1999. V. Koltchinskii, D. Panchenko, and F. Lozano. Some new bounds on the generalization error of combined classifiers. In Advances in Neural Information Processing Systems, volume 13, 2001. J. Lafferty. Additive models, boosting, and inference for generalized divergences. In Proc. 12th Annu. Conf. on Comput. Learning Theory, pages 125–133, New York, NY, 1999. ACM Press. S. Nash and A. Sofer. Linear and Nonlinear Programming. McGraw-Hill, New York, NY, 1996. J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1992. J.R. Quinlan. Boosting first-order learning. Lecture Notes in Computer Science, 1160:143, 1996. G. R¨atsch. Robust Boosting via Convex Optimization. PhD thesis, University of Potsdam, Neues Palais 10, 14469 Potsdam, Germany, October 2001. G. R¨atsch, A. Demiriz, and K. Bennett. Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning, 48(1-3):193–221, 2002. Special Issue on New Methods for Model Selection and Model Combination. Also NeuroCOLT2 Technical Report NC-TR-2000-085. G. R¨atsch, T. Onoda, and K.-R. M¨uller. Soft margins for AdaBoost. Machine Learning, 42(3): 287–320, March 2001. also NeuroCOLT Technical Report NC-TR-1998-021. G. R¨atsch and M.K. Warmuth. Maximizing the margin with boosting. In Proc. COLT, volume 2375 of LNAI, pages 319–333, Sydney, 2002. Springer. S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin separator. Technical report, Department of Statistics, Stanford University, 2002. R.E. Schapire. The Design and Analysis of Efficient Learning Algorithms. PhD thesis, MIT Press, 1992. R.E. Schapire, Y. Freund, P.L. Bartlett, and W.S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, October 1998. R.E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, December 1999. also Proceedings of the 14th Workshop on Computational Learning Theory 1998, pages 80–91. J. von Neumann. Zur Theorie der Gesellschaftsspiele. Math. Ann., 100:295–320, 1928. T. Zhang. Sequential greedy approximation for certain convex optimization problems. Technical report, IBM T.J. Watson Research Center, 2002.
17