[1] and - Semantic Scholar

Report 1 Downloads 302 Views
CONVEX OPTIMIZATION IN DATA SCIENCE SEMINAR HON-LEUNG LEE

The sources of this note include the textbooks [1] and [2]. The former one is legible for general people while the latter one is for those who are interested in convex analysis. The theory about support vector machines can be found in any statistical learning text or UW Stat 535 notes. One remark came from the lectures in Emily Fox’s big data class. Coordinate descent for LASSO can be found in [3]. As the author is busy the pictures drawn in the talk are pending.

1. Basic notions about convexity Consider a function f : Rn → R ∪ {+∞}. It is proper if it is not identically +∞. It is convex if the epigraph of f , epi(f ) = {(x, α) : f (x) ≤ α} is convex in Rn+1 . It is closed (or lower semi-continuous, or lsc) if the epigraph of f is closed. We define the domain of f to be dom(f ) = {x ∈ Rn : f (x) < +∞}. Most of the functions we consider below are proper, convex and closed. The next important definition is a notion of derivative for convex functions. Definition 1.1. Given a proper convex function f : Rn → (−∞, +∞], g ∈ Rn is said to be a subgradient of f at x0 ∈ Rn if f (x) ≥ f (x0 ) + g > (x − x0 ) for all x ∈ Rn . The collection of all such g’s is called the subdifferential of f at x0 , denoted by ∂f (x0 ). Notice if f (x0 ) = +∞ then ∂f (x0 ) = ∅. For the map f (x) = |x| on R, we have   1 (1.1) ∂f (x0 ) = [−1, 1]   −1

if x0 > 0 if x0 = 0 if x0 < 0.

Notice that subdifferential coincides with gradient when the function is smooth. Proposition 1.2 (Differentiable smooth functions). If f is convex and differentiable at x0 ∈ int dom(f ), then ∂f (x0 ) = {∇f (x0 )}. The following is the analogue of Fermat rule for smooth functions. Proposition 1.3 (Fermat rule). If f is convex, then x0 ∈ dom(f ) is a minimizer of f (that is, f (x) ≥ f (x0 ) for any x) if and only if 0 ∈ ∂f (x0 ).

Date: April 30, 2015. 1

2

HON-LEUNG LEE

2. Convex programs and Lagrangian duality n TmLet f0 , f1 , · · · , fm : R → R ∪ {+∞} be proper convex functions. Let D := dom(f ) be the convex i i=0 Tmdomain. To make sure subdifferential sum rule can be applied later, we assume i=1 int dom(fi ) 6= ∅. Consider the primal problem

p∗ := inf f0 (x) (P)

x∈D

subject to fi (x) ≤ 0, i = 1, · · · , m.

Any point in D that satisfies all constraints is called (primal) feasible. The set of feasible points is clearly convex. Define the Lagrangian L : D × Rm → R given by m X L(x, λ) := f0 (x) + λi fi (x). i=1

Consider the concave dual function g : R

m

→ R ∪ {−∞} given by

g(λ) = inf L(x, λ). x∈D

Then the dual problem is given by d∗ := sup g(λ) λ∈dom(g) (D) subject to λ ≥ 0. It is very easy to prove this result. Theorem 2.1 (Weak duality). One has d∗ ≤ p∗ . The equality in the above equality holds if Slater condition holds for (P), that is, there is x ∈ int(D) such that x is primal feasible and x attains strict inequality for any non-affine inequality constraint. This result can be proved by some standard separating hyperplane argument; see the references for details. Theorem 2.2 (Strong duality and dual attainment). If Slater condition holds for (P), then p∗ = d∗ , and d∗ is attained if d∗ is finite. 3. Karush-Kuhn-Tucker (KKT) conditions Definition 3.1. We say (x, λ) satisfies KKT conditions if we have • (Primal feasibility) fi (x) ≤ 0 for all i. • (Dual feasibility) λ ≥ 0. • (Complementary slackness)P λi fi (x) = 0 for all i. m • (Stationarity) 0 ∈ ∂f 0 (x) + i=1 λi ∂fi (x); Notice that stationarity is equiv alent to 0 ∈ ∂L(·, λ) . x

For convex programs with Slater condition, the primal and dual optimalities are completely certified using KKT conditions: Theorem 3.2 (Optimality is equivalent to KKT). If (x∗ , λ∗ ) satisfies KKT conditions then (x∗ , λ∗ ) is a tuple of primal-dual solutions (for (P) and (D)). The converse holds if strong duality holds. In either case, x∗ minimizes the Lagrangian fixing λ∗ . In the following we illustrate how the above theory plays a role in data science / machine learning.

CONVEX OPTIMIZATION IN DATA SCIENCE SEMINAR

3

4. Support vector machines (SVM) for binary classification 4.1. Motivation and primal C-SVM. Let {(xi , y i )}N i=1 be the training data where xi ∈ Rp and y i ∈ {+1, −1}. First assume the inputs xi ’s are linearly separable. Consider the linear predictor f (x) = fw,b (x) = w> x + b where w ∈ Rp \ {0}, b ∈ R. Our goal is to find a separating hyperplane that maximizes the margin defined as mini dist(xi , f −1 (0)). In other words the minimization problem is max min dist(xi , f −1 (0)) s.t. y i f (xi ) ≥ 1 for any i.

(4.1)

i

w6=0,b∈R

Notice we want the margin to be positive, hence we can impose y i f (xi ) ≥ ε instead of y i f (xi ) > 0. By scaling of w and b we can put down y i f (xi ) ≥ 1. Notice that (xi )| dist(xi , f −1 (0)) = |fkwk and we assume for any w, b there is an active constraint, that is, f (xi ) = ±1 for some i. Hence (4.1) is equivalent to 1 kwk22 s.t. y i f (xi ) ≥ 1 for any i. ,b∈R 2

min p

(SVM)

w∈R

At the optimum of (SVM), the points xi corresponding to active constraints are called support vectors. The problem (SVM) is not practical because • It is not robust, the removal of an outlier will cause a big change in the decision boundary f = 0. Because of this, (SVM) is also called hard-margin SVM. • It does not work if the data is not linearly separable. To solve this problem, one can map the current feature space to a high-dimensional feature space by a map φ until the images of the data become almost linearly separable. To address the above two problems (and to prevent overfitting), we can use C-SVM (or soft-margin SVM) instead. Note C-SVM introduces slacks to allow misclassification or margin errors, but these will be penalized in the objective function (with parameter C > 0 usually chosen by cross-validation): N

(C-SVM)

min

w∈Rp ,b∈R,ξ∈RN

X 1 ξi s.t. y i f (xi ) ≥ 1 − ξi for any i. kwk22 + C 2 i=1 ξi ≥ 0 for any i.

If C is small then the optimal w∗ is small, so the margin is large, and the “band” is wide, mistakes are lightly penalized. This may cause underfitting. If C is large then mistakes are heavily penalized so the “band” is narrow, and the effect is the same as hard-margin SVM. 4.2. Dual C-SVM and KKT conditions. Clearly (C-SVM) is a primal convex program, now we figure out its dual. Let αi ’s and µi ’s be the Lagrange multipliers of the first and second set of constraints in (C-SVM) respectively. Then the Lagrangian is L(w, b, ξ, α, µ) =

N N N X X X 1 kwk22 + C ξi + αi [1 − ξi − y i (w> xi + b)] − µi ξi . 2 i=1 i=1 i=1

4

HON-LEUNG LEE

Some of the KKT conditions are (4.2)

α ≥ 0, µ ≥ 0

(4.3)

X ∂L = 0 ⇐⇒ w = αi xi y i ∂w i=1

(4.4)

X ∂L = 0 ⇐⇒ αi y i = 0 ∂b i=1

(4.5)

∂L = 0 ⇐⇒ α + µ = C1. ∂ξ

N

N

Using (4.3), (4.4) and (4.5), one has the dual function  >

2 N N N N

X X X 1

X

g(α, µ) = αi − αi y i  α j xj y j  xi αi xi y i +

2 i=1 i=1 i=1 j=1 2

=−

1 2

N X N X

y i y j (xi )> xj αi αj +

i=1 j=1

N X

αi

i=1

1 = − α> Gα + 1> α 2 where Gij := y i y j (xi )> xj is the (signed) Gram matrix. Note g does not involve µ. Incorporating this dual function with the related conditions (4.2), (4.4) and (4.5), we see the dual problem is this convex problem min

α∈RN

N X 1 > αi y i = 0 α Gα − 1> α s.t. 2 i=1

α + µ = C1 α, µ ≥ 0, which is equivalent to the quadratic program

(Dual C-SVM)

min

α∈RN

N X 1 > αi y i = 0 α Gα − 1> α s.t. 2 i=1

0 ≤ α ≤ C1. From the above dual problem we draw these conclusions. • Strong duality holds. • Once the dual optimizer α∗ is acquired, we can get the primal one by (4.3). • The Lagrange multiplier vector α helps us classify the performance of the predictor f . All the following comes from complementary slackness. • If αi∗ > 0 then y i f ∗ (xi ) = 1 − ξi∗ . • If αi∗ < C then µ∗i > 0 so ξi∗ = 0. • Hence if 0 < αi∗ < C, then y i f ∗ (xi ) = 1 and xi is a support vector, this is where we can compute b∗ . • If y i f ∗ (xi ) > 1 then yi f ∗ (xi ) > 1 − ξi∗ so αi∗ = 0. The converse usually holds. Thus correct classification corresponds to αi∗ = 0.

CONVEX OPTIMIZATION IN DATA SCIENCE SEMINAR

5

• If y i f ∗ (xi ) < 1 then 1 − ξi∗ ≤ y i f ∗ (xi ) < 1 then ξi∗ > 0 so µ∗i = 0 so αi∗ = C. The converse usually holds. Thus margin error (0 < y i f ∗ (xi ) < 1) or misclassification (y i f ∗ (xi ) < 0) corresponds to αi∗ = C. To sum up, there are at least three reasons why we study the dual problem: • Lagrange multipliers are meaningful. They help understand the structure of the problem. • New algorithms/methods are arised from the dual problem. • Possibly more efficient in solving dual problem instead of primal.

4.3. Algorithms of solving C-SVM. For the primal problem, we notice that if y i f (xi ) ≥ 1, ξi should be pushed to zero for minimization; If y i f (xi ) < 1 then we require ξi ≥ 1 − y i f (xi ) > 0. For minimization we expect ξi = 1 − y i f (xi ). Combining two cases ξi = [1 − y i f (xi )]+ . Thus (C-SVM) is equivalent to the following unconstrained convex problem N

X 1 2 [1 − y i f (xi )]+ min kwk + C 2 w∈Rp ,b∈R 2 i=1 which can be rewritten as min p

w∈R ,b∈R 0

N X

Lh (y i , f (xi )) +

i=1

1 kwk22 2C

0

where Lh (y, y ) = (1 − yy )+ is called hinge loss. Hence C-SVM is equivalent to L2 -regularization of hinge loss minimization. Therefore the primal C-SVM can be solved by (stochastic) (sub)-gradient descent. The dual problem can be solved by projected gradient method. First compute the gradient of the dual function g(α) = g(α, µ), project it on the subspace, and figure out a step size so that the dual function value decreases and the update will stay in the box [0, C]N . 5. Least absolute shrinkage and selection operator (LASSO) For supervised learning with dimension of the feature space much greater than the size of the training set, together with performing feature selection, l1 regularization is a way to go. For example, consider the problem 1 min f (β) := ky − Xβk22 + λkβk1 (5.1) β∈Rp 2 given λ > 0, y ∈ RN and X ∈ RN ×p . One can apply (stochastic) coordinate descent to solve this problem. Algorithm 5.1 (Coordinate descent for problem (5.1)). Initialize β, repeat until convergence: Pick one coordinate i ∈ {1, · · · , p} using “round-robin” or whatever, randomly. Update βi = arg min f (β1 , · · · , βi−1 , b, βi+1 , · · · , βn ). b∈R

Remark 5.2. Some remarks:

6

HON-LEUNG LEE

(1) To initialize β, one can take the univariate regression coefficients of each feature. (2) Pass all the coordinates several times, i = 1, · · · , p, 1, · · · , p, 1 · · · , p, · · · . (3) The problem is scalable, in the sense that one can split the coordinates into groups, assign them to different processors, and let them run the coordinate updates in parallel. However one needs to take care of the “interference” between βi ’s. We are going to write down a closed form formula for above βi . Before that, we consider 1 Sλ (x) := arg min |z − x|2 + λ|z|. z∈R 2

(5.2)

for any x ∈ R. People call this the prox-operator of the closed convex function λ| · |. The infimum value of (5.2) is called Moreau envelope of λ| · |, which is the infimal convolution of the scaled 1-norm and the squared 2-norm. Notice that z = Sλ (x) if and only if 0 ∈ z − x + λ∂|z|.

(5.3)

Then, by (1.1), one has   z + λ x = [−λ, λ]   z−λ

if z > 0 if z = 0 if z < 0

and hence   x − λ Sλ (x) = 0   x+λ

if x > λ if − λ ≤ x ≤ λ if x < −λ.

The operator Sλ (·) is called soft-thresholding at level λ. Now, let the columns of X be X1 , · · · , Xp . Fix i and keep βj , j 6= i fixed. Consider X−i which is X without column i, and β−i which is β without βi . We need to minimize over βi the following:

2 p

X 1

Xk βk + λ|βi |.

y −

2 k=1

2

By Fermat rule we have 0 ∈ Xi> (Xβ − y) + λ∂|βi | = Xi> Xi βi + Xi> (X−i β−i − y) + λ∂|βi |. Therefore one has 0 ∈ βi +

λ Xi> (X−i β−i − y) + ∂|βi | := βi − mi + λi ∂|βi | kXi k22 kXi k22

where mi :=

Xi> (y − X−i β−i ) λ and λi := . 2 kXi k2 kXi k22

By (5.3), we know βi = Sλi (mi ).

CONVEX OPTIMIZATION IN DATA SCIENCE SEMINAR

7

References [1] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge university press (2004). [2] J.M. Borwein and A.S. Lewis, Convex analysis and nonlinear optimization: theory and examples, Vol. 3. Springer Science & Business Media (2010). ¨ fling and R. Tibshirani, Pathwise coordinate [3] J. Friedman and T. Hastie and H. Ho optimization, The Annals of Applied Statistics 1.2 (2007): 302-332.