Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Multiplicative Approximation Algorithms for Generalized Covering and Packing Problems
Jonathan A. Wagner
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Multiplicative Approximation Algorithms for Generalized Covering and Packing Problems
Research Thesis
Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science
Jonathan A. Wagner
Submitted to the Senate of the Technion — Israel Institute of Technology Tammuz 5776 Haifa July 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
The research thesis was done under the supervision of Prof. Elad Hazan in the Computer Science Department.
I would like to thank with all my heart my advisor Prof. Elad Hazan for giving me the opportunity of working under his supervision and allowing me to witness and to participate in academic research done by world class computer scientists. Thank you for the guidance, for the one-on-one sessions, for the fun, and for caring for my own academic success from the beginning. In addition I want to thank all the people who supported me and helped me during the different stages of the studies: I want to thank Prof. Yuval Yishay for all the times he did his best to help me, I want to thank Prof. Erez Petrank for his help, Prof. Roi Reichart for the interesting talks during my studies, and Prof. Shmuel Zaks for his support, flexibility and care. I want to thank Prof. Nati Srebro for his in-depth consideration of my work and for all his remarks, and also I want to thank all the other Technion faculty that have helped me during my studies. I want to thank the administrative staff of the faculty of the Tecchnion for their help. I want to thank the students from Prof. Hazan’s group who helped me and advised me - Oren Anava, Kfir Levi, Tomer Koren and Dan Garber. In addition I want to thank my Compter Science Officemates Yuri Meshman and Itay Feirewerker, and also Gil eizinger, Ido Cohen, and Rami Band. In addition I want to thank the Rabbies of the Technion - Rabbi Zinni and Rabbi Elad Dukov and the relegious comunity of the Technion for all their support. I want to thank Jonathan Fromm for proofreading parts of this thesis. I want to thank Hadas Sofer for pushing me to finish this thesis. I want to thank my sister Noa and brother Itamar for being there and for their support. Finally I want to thank my Parents for being loving, caring and supporting parents.
The generous financial support of the Technion is gratefully acknowledged.
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Contents Abstract 1 Introduction 1.1 Preliminaries . . . . . . . . 1.2 Previous Work . . . . . . . 1.3 Statement of Results . . . . 1.3.1 Comparison with the 1.3.2 Runtime comparison
1
. . . . .
2 3 8 9 9 10
. . . . . . . .
12 12 12 14 16 16 18 19 20
3 Speeding-up the Framework 3.1 Recursive `1 -sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Amortized Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Analysis of the Sped-up Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24 26 29 29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ’Sublinear Perceptron’ . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . approach . . . . . .
. . . . .
. . . . .
2 Basic Frameworks for multiplicative approximation 2.1 Online Scaling for Generalized Covering and Generalized Packing 2.1.1 Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Covering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Online Scaling for general Min-Max Problems . . . . . . . . . . . 2.2.1 Packing Dual (or Covering Primal) Online Scaling . . . . 2.2.2 Packing Primal (or Covering Dual) Online Scaling . . . . 2.2.3 Primal - Dual Online Scaling . . . . . . . . . . . . . . . . 2.3 Domain Shrinkage for Packing Problems . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . .
. . . . .
. . . . . . . .
. . . . .
. . . . . . . .
. . . . .
. . . . . . . .
4 Applications 31 4.1 Normalized Covering SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Non-negative Linear Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5 Lower Bounds 33 5.1 Lower Bounds for Zero Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2 Lower Bounds for Non-Negative Linear Classifier . . . . . . . . . . . . . . . . . . . . 34 i
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
6 Appendix Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.0.1 Regret of the MWU Algorithm . . . . . . . . . . . . . . 6.0.2 A More Detailed Analysis of the Recursive `1 -Sampling 6.0.3 `1 -Sampling and Amortised Complexity . . . . . . . . . 6.0.4 Analysis of the Sped-up Framework for Covering . . . . 6.0.5 Auxillary Lemma for Multiplicative Approximation . .
ii
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
35 35 35 38 41 43 46
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
List of Figures
iii
iv
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Abstract Approximation of min-max problems is a common task in convex optimization which has many important uses in machine learning and combinatorial optimisation. Approximations of problems can be divided into two main categories - additive approximations and multiplicative approximations. Additive approximations are usually relevant to general settings, but have runtime that in many cases depends significantly on the magnitude of natural parameters of the problem. Multiplicative approximations on the other hand, are natural for nonnegative settings, and unlike the additive case, allow in many cases algorithms that are independent of the magnitude, or width, of the input parameters. This property is also known as width-free running time. Multiplicative approximation can also be useful if the optimum value of the problem is very small. In this case a multiplicative approximation of even 1/2, gives a very small additive approximation which may require much larger running time by additive approximation methods. Recently, for the case of additive approximation, the use of sampling methods together with low-regret algorithms enabled the development of a general method for approximating an important class of min-max problems. This led to remarkably fast algorithms for several important problems in machine learning. This approach, however, did not address the task of multiplicative approximation. In this work we present simple schemes based on low regret algorithms that give widthindependent multiplicative approximation algorithms for two important classes of non-negative min-max problems - generalized covering and generalized packing. Our main contribution is a novel sampling and speed-up technique that in certain cases can be incorporated into the schemes and lead to very fast algorithms. As an application, we describe the first near-linear time, widthfree multiplicative approximation algorithms for Normalized Covering Semi-definite Programming, and for Non-negative Linear Classifier.
1
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Chapter 1
Introduction The min-max problem of a non-negative convex-conave function is of finding x ∈ X and y ∈ Y that achieve: min max g(x, y) x∈X y∈Y
(1.1)
where g : X × Y → R+ is a convex-concave non-negative function, and X ⊂ Rn , Y ⊂ Rm are convex sets. Two important subclasses of the above formulation are the generalized packing and generalized covering problems. These formulations include non-negative linear programming and positive semi-definite programming and have varied applications in machine learning and combinatorial optimization. Examples include the problems of K-Nearest Neighbor Classification [11], MAXCUT [8], Undirected Sparsest Cut [11], minimum-cost multicommodity flow [18], network embeddings [18], Held and Karp bound for the traveling salesman problem [18] and many more. In general convex optimization approximations can be divided into two main categories - additive approximations and multiplicative approximations. Additive approximations are usually relevant to less restricted settings, and have runtime that may depend on the magnitude of the gradients or other natural parameters of the problem. Multiplicative approximations on the other hand, are natural for non-negative settings, and allow in many cases algorithms that are independent of the magnitude, or width, of the input parameters. This property is also known as width-free running time. This latter width-free property can be extremely useful in two settings of interest. First, if the range of the input parameters is very large - many additive approximation methods will run in time polynomially (usually quadratically) proportional to this range. In contrast, width-free methods are invariant to the magnitude of the input parameters. Second - if the optimum value of the entire formulation is very small , a multiplicative approximation of 1/2, say, gives a very small additive approximation which may require much larger running time by additive approximation methods. Recently, [5] developed a general method for additively approximating an important class of min-max problems. The key idea behind the method was coupling of low-regret algorithms from online convex optimization and sampling techniques. This approach proved to be very effective as special cases and subsequent developments gave remarkably fast algorithms for several important 2
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
problems in machine learning such as training a linear classifier, SDP, and linear SVM. This approach, however, could not be directly applied to obtain width-free multiplicative approximations. In this work we present general frameworks for multiplicative approximation algorithms for covering and packing problems based on variations of the multiplicative weights method. Our main contribution is a novel sampling and speed-up technique that in certain cases can be coupled with the frameworks and lead to very fast algorithms. As an application, we describe the first near-linear time, width-free multiplicative approximation algorithms for Normalized Covering Semi-definite Programming, and for Non-negative Linear Classifier.
1.1
Preliminaries
Generalized Covering and Packing Problems This work deals with several problems which belong to the general form of solving the min-max problem of a non-negative convex-conave function as follows: min max g(x, y) x∈X y∈Y
(1.2)
where g : X × Y → R+ is a convex-concave non-negative function, X ⊂ Rn and Y ⊂ Rm . For this problem we will focus on finding a multiplicative approximation: Given in addition an approximation parameter ∈ (0, 1), Find x ¯ ∈ X , y¯ ∈ Y s.t. (1 − ) max g(¯ x, y) ≤ λ∗ ≤ y∈Y
1 min g(x, y¯) 1 − x∈X
(1.3)
We note that in this work we always assume that minx∈X maxy∈Y g(x, y) = maxy∈Y minx∈X g(x, y) . Two important special cases of the above problem are the generalized packing and generalized covering problems. The generalized packing problem, which is sometimes referred to as the min-max resource sharing problem (e.g. in [13]) is the following problem: min max fi (x) x∈K i∈[m]
(1.4)
where K ⊂ Rn is a convex set and fi is a non-negative convex function over K for all i ∈ [m] (we denote [m] := {1, ..., m}). and the corresponding approximation problem is: Find x ∈ K s.t. max fi (x) ≤ (1 + ) min max fi (x0 ) 0 x ∈K i∈[m]
i∈[m]
where is a non-negative approximation parameter. 3
(1.5)
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
m T Let us denote by f (x) the vector (fi (x))m i=1 , and for any q ∈ R denote q f (x) =
m P
qi fi (x). At
i=1
first glance problem 1.4 might not seem a special case of the above non-negative min-max problem, m P but replacing maxi∈[m] fi (x) with maxp∈∆m pT f (x) where ∆m = {p ∈ Rm | pi = 1, ∀i ∈ [m] : pi ≥ i=1
0}, we get an equivalent problem which is a special case of the generalised packing problem. The equivalent formulation is therefore: min max pT f (x) (1.6) x∈K p∈∆m
and the approximation version: Find x ∈ K s.t. max pT f (x) ≤ (1 + ) min max pT f (x0 ) 0
(1.7)
x ∈K p∈∆m
p∈∆m
The generalized covering problem, which is sometimes referred to as the max-min resource sharing problem (e.g. in [13]) is the following problem: max min fi (x) x∈K i∈[m]
(1.8)
where K ⊂ Rn is a convex set and fi is a non-negative concave function over K for all i ∈ [m]. The corresponding approximation problem is: Find x ∈ K s.t. min fi (x) ≥ (1 − ) max min fi (x0 ) 0
(1.9)
x ∈K i∈[m]
i∈[m]
where is a non-negative approximation parameter. Again by replacing mini∈[m] fi (x) with minp∈∆m pT f (x) we get an equivalent problem which is a special case of the general non-negative min-max problem. The equivalent formulation is therefore: max min pT f (x) x∈K p∈∆m
(1.10)
and the approximation version: Find x ∈ K s.t. min pT f (x) ≥ (1 − ) max min pT f (x0 ) 0
(1.11)
x ∈K p∈∆m
p∈∆m
We note that in this work we may switch seamlessly between optimizing over the discrete set [m] and optimizing over the convex set ∆m , when applicable, as done in this section. We also note that in this work we always assume that maxx∈K minp∈∆m pT f (x) = minp∈∆m maxx∈K pT f (x) and may use this equality when needed. It is important to note though that it is not always possible to switch between optimizing over [m] and optimizing over ∆m , for example minp∈∆m maxx∈K pT f (x) is not necessarily equal to mini∈[m] maxx∈K pT f (x). 4
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Some Other Notations Some other notations we use are the following: for v ∈ Rm and E ⊂ [m] we denote v|0E to be a vector in Rm such that v|0E (i) = v(i) for i ∈ E, and v|0E (i) = 0 for i 6∈ E. For a function f over some 0 0 0 domain K we will denote arg max1− x0 ∈K f (x ) to be some x ∈ K, for which f (x) ≥ (1−) maxx ∈K f (x ) .
We will now present the special cases of generalised covering and generalised packing that we will consider. Covering SDP Problems For Semidefinite Programming (SDP) we will use the following notations. We will denote Sn+ = {X ∈ Rn×n |X 0} and for two matrices A, B ∈ Rn×n we will denote A • B := Tr(AT B) where Tr stands for the trace of a matrix. We will define a Normalized Covering SDP problem as follows: min Tr(X)
X∈Sn +
(1.12)
s. t. Ai • X ≥ 1 ∀i = 1, ..., m where ∀i : 0n×n 6= Ai 0. In [12] and [17] the algorithms proposed for general covering SDP are first converted to this formulation and then solved. Notice this formulation may be referred to in the literature as Normalized Positive SDP (as in [17]). The following problem is a special case of the generalised covering problem which we will call Maxmin Positive SDP:
max min Ai • X X∈K i∈[m]
(1.13)
K = {X ∈ Sn+ |Tr(X) ≤ 1} where Ai for i = 1, ..., m are positive semidefinite matrices in Rn×n . The Normalized Covering SDP can be reduced to the Maxmin Positive SDP as stated in the following claim: Claim 1.1.1. Denote λ∗N C the optimum of the Normalized Covering SDP and λ∗M P the optimum of the Maxmin Positive SDP. If λ∗M P = 0 then the Normalized Covering SDP has no feasible solution (λ∗N C = ∞), otherwise λ∗N C = λ∗1 . In the case λ∗M P > 0, if XM P ∈ K is a 1 − -multiplicative MP approximation to the Maxmin Positive SDP, that is mini∈[m] Ai • XM P = α ≥ (1 − )λ∗M P > 0, then 1 1 α XM P is a 1− -multiplicative approximation to the Normalized Covering SDP problem. 5
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Non-negative linear classifier In this problem we are given points in the positive orthant, {ai }m i=1 and we wish to find a hyperplane that passes through the origin which maximizes the minimal distance from the points. Formally, we have the problem:
max min aTi x
X∈Bn i∈[m]
(1.14)
where ai ∈ Rn+ for i = 1, ..., m. It will be convenient for us to think of aTi as rows of a matrix A and formulate our problem in the following way:
max min pT Ax
X∈Bn p∈∆m
(1.15)
where A ∈ Rm×n for i = 1, ..., m. +
We finish this section by presenting auxiliary techniques and lemmas that we will use in this work.
The Multiplicative Weights Update Method We now present the Multiplicative Weights algorithm, a review of this method could be found in [3]. Notice that this algorithm is an Online Convex Optimization algorithm, which also means that it does not obtain all its input at once but rather iteratively receives input and performs calculations. The usage of this algorithm (and its variations) as a building block in subsequent algorithms in this work is a bit uncommon. Every time the algorithm is being called, it is actually performing another iteration with the new input. To understand the way algorithm 1 and its variations, are used by other algorithms it may be helpful to think of it as an object with internal constants (e.g. in Alg. 1 the numbers η, m) and internal variables (e.g. the vectors p and w), which has an initialization method, and a ”step” method. When called for the first time (without any input) the ”step” method just outputs variables (e.g. p) obtained by the initialization (e.g. line 2). Following this, every time the ”step” method is being called with an input v ∈ [−1, 1]m , it updates its internal variables (e.g. lines 5 and 6) and outputs some variable (e.g. p). For the case of Alg. 1 the internal variable’s p value when being outputted for the t-th time could be thought of pt . Notice that the t-th output is generated before observing the t-th input vt . 6
Algorithm 1 Multiplicative Weights Algorithm (with costs vectors)
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
1: 2: 3: 4: 5: 6: 7:
Parameters: m - number of experts, η < 12 1 Initialization (t = 0): w1 ← 1m , p1 ← m . m for t = 1, 2, ..., T do Obtain costs vector vt ∈ [−1, 1]m ∀i ∈ [m] : wt+1 (i) ← wt (i)(1 − ηvt (i)) pt+1 ← kwwttk1 end for For this algorithm the following regret bound can be shown:
Lemma 1.1.1. (Corollary 2.2 in [3]) For any p ∈ ∆m it holds that T X
pTt vt
≤
t=1
T X
pT (vt + η|vt |) +
t=1
ln m η
where |vt | is the vector obtained by taking the coordinate-wise absolute value of vt From the above lemma we can directly obtain the the following: Lemma 1.1.2. Assume that all vt are non-negative cost vectors. Then T X
pTt vt
≤ (1 + η) min
p∈∆m
t=1
T X
p T vt +
t=1
ln m η
In case the vectors vt represent gains which we wish to maximise, we can run algorithm 1 with cost vectors −vt . For this version of the algorithm, which we will refer to as the Multiplicative Weights algorithm with gains vectors, we get the following regret bound: Lemma 1.1.3. (Corollary 2.6 in [3]) For any p ∈ ∆m it holds that T X
pTt vt ≥
t=1
T X
pT (vt − η|vt |) −
t=1
ln m η
where |vt | is the vector obtained by taking the coordinate-wise absolute value of vt From the above lemma we can directly obtain the the following: Lemma 1.1.4. Assume that all vt are non-negative gain vectors. Then (1 − η) max
p∈∆m
T X
T
p vt ≤
t=1
T X t=1
7
pTt vt +
ln m η
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Combining Multiplicative Approximations The following lemma is an algebraic lemma which is useful in combining multiplicative approximations: Lemma 1.1.5. Assume that: ∀i ∈ [n] : (1 − i )Ai−1 ≤ (1 + ηi )Ai + αi Where ∀i ∈ [n] : αi ≥ 0, i ∈ [0, 1), ηi ∈ [0, 1), i + ηi < 1 and ∀i ∈ [n] ∪ {0} : Ai > 0. Then for any k ∈ [n] ∪ {0}: n P
αi n n X X An i=1 ≥1− + η + i i A0 Ak i=1
And if also
n P i=1
i +
n P i=1
n P
ηi +
αi
i=1
Ak
0, η < 12 2: Initialization (t = 0): V0 ← 0m , E1 ← [m], w1 ← 1m , p1 ← 3: for t = 1, 2, ..., T do 4: Obtain non-negative costs vector vt ∈ [0, 1]m 5: Vt ← Vt−1 + vt 6: Et+1 ← {i ∈ [m] : Vt (i) < U } 7: ∀i ∈ [m] : wt+1 (i) ← wt (i)(1 − ηvt (i)) 8:
pt+1 ←
wt+1 |0E
t+1
kwt+1 |0E
t+1
9:
k1
end for
15
1 . mm
Lemma 2.1.1. Assume that ET +1 6= ∅, then the following holds:
T P
pt T vt ≤ (1+η) min
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
t=1 log m η
T P
x∈∆m t=1
p T vt +
(the regret bound is exactly the same as in the standard MW algorithm).
The proof of the above lemma could be found in the appendix. The following theorem guarantees correctness of the general framework and gives a bound on the number of iterations. Its proof follows the same lines as the proof for the fast version (theorem 3.0.3 ).
Theorem 2.1.2. The Online Scaling Algorithm for Generalized Covering returns a 1 − O() m log m iterations. multiplicative approximate solution in O 2
2.2
Online Scaling for general Min-Max Problems
In this section we present templates for multiplicative approximation of general non-negative minmax problems. Before presenting the templates we mention that in this section we will refer to the regret of OCO algorithms in a slightly modified form, which will be easier to deal with in the context of multiplicative approximations. We will say that R(T ),regret are the regret parameters of an OCO Online-Algcosts if: (1 − regret )
T X
ft (xt ) ≤ min x∈X
t=1
T X
ft (x) + R(T )
t=1
We will say that R(T ),regret are the regret parameters of an OCO Online-Alggains if: (1 − regret ) max x∈X
T X
ft (x) ≤
t=1
T X
ft (xt ) + R(T )
t=1
Notice that for all previous OCO bounds we can just choose R(T ) to be the original bound and regret to be zero. On the other hand, for the MW algorithm we can choose regret = η and R(T ) = logη n .
2.2.1
Packing Dual (or Covering Primal) Online Scaling
We will begin by introducing the general Packing Dual Online Scaling template. 16
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Algorithm 5 Packing Dual Online Scaling 1: Input: non-negative convex-concave function g : X × Y → R+ , where X ⊂ Rn and Y ⊂ Rm are convex. Approximation parameter ∈ [0, 1). 2: for t = 0...T − 1 do 3: Let yt+1 ← Online-Alggains (wt · g(xt , y)) 1 1− Let xt+1 ← arg minx∈X g(x, yt+1 ) 5: Set wt+1 > 0 6: end for P P w y w x 7: Return x ¯ = Pt wt t t and y¯ = Pt wt t t
4:
t
t
Note: Notice that wt must be chosen so that wt · g(xt , y) is a legal input for the Online-Alggains . We now present a theorem which can be used to guarantee correctness and number of iterations. Theorem 2.2.1. Let R(T ) and regret ∈ [0, 1) be the regret parameters of Online-Alggains . Then
min g(x, y¯) ≥ max g(¯ x, y) 1 − − regret − x∈X y∈Y
maxy∈Y
R(T ) T P wt · g(xt , y) t=1
Notice that in order to guarantee a multiplicative approximation we need to complete the R(T ) template such that regret + is smaller than O(). This will also typically imply T P wt ·g(xt ,y)
maxy∈Y
t=1
the number of iterations. Proof. First, for any y 0 ∈ Y the function g(x, y 0 ) is convex in x and so
T P
wt g(xt , y 0 ) ≥
t=1
T P t=1
P T
wt xt
wt g t=1 T P t=1
T P
wt g(¯ x, y 0 ). Since this holds for all y 0 we get:
t=1 T X
! wt
t=1
max g(¯ x, y) ≤ max y∈Y
y∈Y
T X
wt g(xt , y)
(2.4)
t=1
Second from the regret of the Online-Alggains algorithm we get that: (1 − regret ) max y∈Y
T X
wt g(xt , y) ≤
t=1
T X
wt g(xt , yt ) + R(T )
(2.5)
t=1
Third, from the choice of xt each iteration we get that ∀t ∈ [T ] : g(xt , yt ) ≤ (1+) minx∈X g(x, yt ), and so, T T X X wt g(xt , yt ) ≤ (1 + ) wt min g(x, yt ) (2.6) t=1
t=1
17
x∈X
wt
, y0 =
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
and from basic algebra laws and positivity of wt : T X
wt min g(x, yt ) ≤ min x∈X
t=1
x∈X
T X
wt g(x, yt )
(2.7)
t=1
Last, since for any x0 ∈ X the function g(x0 , y) is concave in y, we get in a similar manner to the way we obtained 2.4 that: min x∈X
T X
wt g(x, yt ) ≤
t=1
T X
! wt
t=1
min g(x, y¯)
(2.8)
x∈X
Combining all inequalities (2.4, 2.5, 2.6, 2.7, 2.8) according to the auxiliary algebraic lemma 1.1.5 gives us:
min g(x, y¯) ≥ max g(¯ x, y) 1 − − regret − x∈X y∈Y
maxy∈Y
R(T ) T P wt · g(xt , y) t=1
2.2.2
Packing Primal (or Covering Dual) Online Scaling
The following is the general Packing Primal (or Covering Dual) Online Scaling template. Algorithm 6 Packing Primal Online Scaling 1: for t = 0...T − 1 do 2: Let xt+1 ← Online-Algcosts (wt · g(x, yt )) 3: Let yt+1 ← arg max1− y∈Y g(xt+1 , y) 4: Set wt+1 > 0 5: end for P P w x w y 6: Return x ¯ = Pt wt t t and y¯ = Pt wt t t t
t
Note: Notice that wt must be chosen so that wt · g(x, yt ) is a legal input for the Online-Algcosts . We now present a theorem which can be used to guarantee correctness and number of iterations. Theorem 2.2.2. Let R(T ),regret be the regret parameters of Online-Algcosts . Then
min g(x, y¯) ≥ λ∗ 1 − − regret − x∈X
minx∈X
R(T ) T P wt · g(x, yt ) t=1
Notice that in order to guarantee a multiplicative approximation we need to complete the 18
template such that regret + minx∈X
R(T ) T P wt ·g(x,yt )
is smaller than O(). This will also typically imply
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
t=1
the number of iterations. Notice that instead of predetermining T sometimes it may be useful to continue the iterations until indeed this condition holds (as done for example in algorithm 3). The proof for this theorem resembles the proof of 2.2.1, it is omitted and could be found in the appendix.
2.2.3
Primal - Dual Online Scaling
The Online Scaling primal-dual template for non-negative min-max problems is the following: Algorithm 7 Primal-Dual Online Scaling 1: for t = 0...T − 1 do 2: Let yt+1 ← Online-Alggains (wt · g(xt , y)) 3: Let xt+1 ← Online-Algcosts (wt · g(x, yt )) 4: Set wt+1 > 0 5: end for P P w x w y 6: Return x ¯ = Pt wt t t and y¯ = Pt wt t t t
t
Note: Notice that wt must be chosen so that wt · g(xt , y) is a legal input for the Online-Alggains and also wt · g(x, yt ) is a legal input for the Online-Algcosts . The following theorem can be used to guarantee correctness and number of iterations. The proof can be found in the appendix. Theorem 2.2.3. Let Rcosts (T ),regret-costs be the regret parameters of Online-Algcosts , and let Rgains (T ),regret-gains be the regret parameters of Online-Alggains . Then:
Rcosts (T ) + Rgains (T ) min g(x, y¯) ≥ max g(¯ x, y) 1 − regret-costs − regret-gains − T x∈X y∈Y P maxy∈Y wt · g(xt , y) t=1
Moreover, the inequality remains correct when we replace maxy∈Y
T P
wt ·g(xt , y) with minx∈X
t=1
g(x, yt ) or with
T P
T P
wt ·
t=1
wt · g(xt , yt )
t=1
As in the previous templates in order to guarantee a multiplicative approximation we need to R (T )+Rgains (T ) (or the expression that complete the template such that regret-costs + regret-gains + costs P T wt ·g(xt ,y)
maxy∈Y
t=1
is chosen to replace this as stated in the theorem) is smaller than O(). This will also typically imply the number of iterations. Notice that instead of predetermining T sometimes it may be useful to continue the iterations until indeed this condition holds (as done for example in algorithm 3). 19
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
2.3
Domain Shrinkage for Packing Problems
In this section we describe a different approach for obtaining a multiplicative approximation. In this approach the algorithm is exactly a simple dual algorithm for obtaining the additive approximation with one change - we shrink the domain of the dual function from the simplex ∆m to the simplexα alpha- ∆αm := {x ∈ ∆m , xi ≥ m }. The multiplicative nature of the approximation in this approach is based on two principles: First, roughly speaking, shrinking the domain of a concave function affects the maximum only in a multiplicative manner (this claim is stated more formally in claim 2.3.1). Second, limiting the dual function (f Dual (p) = minx∈K pT f (x)) to the simplex-alpha bounds the norm of the gradients of the dual function by a constant that is a multiplicative the optimum λ∗ . We now present the Domain Shrinkage framework for generalised packing:
Algorithm 8 Domain Shrinkage Framework for Generalized Packing 1: Input: m non-negative convex functions fi , over a convex domain K. Approximation parameter ∈ (0, 12 ). 4m log m 2: Set T ← d 3 T e, α ← , ηMW ← 3: for t = 0, ..., T − 1 do 4: Let pt+1 ← MW-step(pt , xt ) over ∆αm 5: Let xt+1 ← arg minx∈K g(pt+1 , x). 6: end for P 7: Return x ¯ = T1 t xt .
Note: The primal minimisation, arg minx∈K g(pt+1 , x), can be replaced with a multiplicative approximated minimisation, and the new framework can obtain similar results, using slight modifications of the proofs. Notice that since we changed the domain for which the MW is applied to, we had to modify the MW algorithm to fit the new domain. The modified MW algorithm is presented next in Algorithm 9. Notice the algorithm is formulated to already handle gain vectors with an arbitrary bound G on their norm rather than the usual bound of 1. We will note that the Domain Shrinkage framework assumes knowledge of λ∗ since MW∆αm uses G which depends on λ∗ . 20
Algorithm 9 MW algorithm over ∆αn for gains with bounded `∞ norm
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
1: 2: 3: 4: 5: 6: 7:
1 Input: m - number of experts, G - bound on the `∞ norm of the gain vectors, α ∈ (0, m ], η < 1 Set p1 = p˜1 = m m , w1 = 1m , for t = 1...T do Guess pt Incur gains vector ft ft wt+1 ← wt eη G (entry-wise) t+1 p˜t+1 ← |wwt+1 |1
1 2
1 pt+1 ← (1 − α)˜ pt+1 + α m m 9: end for
8:
The following lemma states that despite the modification of the algorithm, when the gains are non-negative the regret bound is the same. Lemma 2.3.1. For non-negative gain vectors with bound G on the infinity norm, the MW algorithm over ∆αn achieves: Regret (T ) = maxα p∈∆n
T X t=1
pT ft −
T X
pt T ft ≤ η maxα
t=1
p∈∆n
T X
pT ft +
t=1
G log m η
We now state the main theorem regarding the correctness and runtime of the Domain Shrinkage framework: Theorem 2.3.1. The Domain Shrinkage Framework for Packing returns a 1 + O()-multiplicative m log m approximate solution in O iterations. 3 In order to prove this theorem we will need the following two lemmas. Lemma 2.3.2. Consider the modified problem in which p ∈ ∆αn = {x ∈ ∆n , xi ≥ αn }, i.e. min maxα pT f (x) = µ∗ = maxα min pT f (x) p∈∆n x∈K
x∈K p∈∆n
Then: (1 − α)λ∗ ≤ µ∗ ≤ λ∗ Definition 2.3.1. For p ∈ ∆m we define xp = arg minx∈K pT f (x). We will note that f (xp ) ∈ ∇p f Dual (p) where f Dual (p) = minx∈K pT f (x). Lemma 2.3.3. Let p ∈ ∆αm . Then ∀i ∈ [m] : fi (xp ) ≤ kf (xp )k1 ≤
λ∗ m α .
The proofs of these lemmas appear after the proof of the theorem. Proof. (Theorem 2.3.1) Let G∞ be a bound on the `∞ norm of the gradients given to the MW∆αm algorithm, and notice ∗ that from lemma 2.3.3 we can define G∞ = λ αm . Now, 21
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
= maxp∈∆m pT , f (¯ x)
max fi (¯ x) i
≤
1 1−α
x) maxp∈∆αm pT f (¯
claim 2.3.1
T P pT f (xt ) maxp∈∆αm t=1 T P T G∞ log m 1 1 1 ≤ 1−α T 1−η pt f (xt ) + η t=1 1 1 1 ∗ + G∞ log m ≤ 1−α T µ T 1−η η 1 1 1 ∗ + G∞ log m ≤ 1−α T λ T 1−η η
≤
≤
1 1 1−α T
1 1 ∗ 1−α 1−η λ 1 ∗ = (1−) 2λ
+ +
1 λ∗ m log m 1 1−α 1−η αηT λ∗ m log m 1 (1−)2 2 T
≤ λ∗ 1 + 6 +
4m log m 2 T
≤ λ∗ (1 + 7)
defn of x ¯ and convexity of f regret (lemma 2.3.1) defn of xt lemma 2.3.2 lemma 2.3.3 choice of α and η
0. 2: Perform pre-processing for the recursive `1 -sampling 7m log m e, MW (gains) Parameters: m, η ← 3: T ← d 2 4: for t = 0...T − 1 do 5: Let pt+1 ← MW-step (vt ) T 6: Let xt+1 ← arg min1+ x∈K pt+1 f (x) 1 7: ωt+1 ← kf (xt+1 )k1 8: vt ← ωt · recursive `1 -sample (f (xt )) 9: end for P ω x 10: Return x ¯ = Pt ωt t t . t
For this framework we have the following result: Theorem 3.0.2. The Online Scaling Framework for Packing with recursive `1 sampling (Algorithm m log m 10) runs O iterations, and with probability at least 3/4 returns a (1 + O())-multiplicative 2 approximation solution. then algorithm 10 can be implemented In the case fi arelinear functions, mn log2 m m log m operations plus O minimizations of a linear function over K to run in O 2 2
Algorithm 11 Online Scaling for Covering with Recursive `1 Sampling 1: Input: m non-negative concave functions fi , over domain K. Approximation parameter . 2: Perform pre-processing for the recursive `1 -sampling 9 log m 3: K ← 2 , t ← 0, MWU Parameters: m, U ← 2K, η ← . t P 4: while mini vs (i) < K do s=1
5: 6: 7: 8:
pt+1 , Et+1 ← MWU -step(vt ) T xt+1 ← arg max1− x∈K pt+1 f (x) Update the recursive `1 -sampling data-structure with ∆E equal Et \ Et+1 ωt+1 ← kf |0 1(x )k Et+1
t+1
1
vt+1 ← ωt+1 · recursive `1 -sample(f (xt+1 )|0Et+1 ) 10: t←t+1 11: end while P 9:
12:
Return x ¯=
ωs xs P ωs .
s≤t
s≤t
For this framework we have the following result: m iterations, and with probTheorem 3.0.3. Algorithm 11 always ends after at most O m log 2 ability at least 3/4 returns a (1 − O())-multiplicative approximation solution. In the case fi are 25
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
linear functions, then algorithm 11 can be implemented to run in O m maximizations of a linear function over K O m log 2
mn log2 m 2
operations plus
We now introduce the speed-up techniques used in these frameworks. The proofs of theorems 3.0.2 and 3.0.3 is discussed in section 3.3.
3.1
Recursive `1 -sampling
We define the `1 -sampling of a non-negative vector u ∈ Rm to be the random variable v ∈ Rm ui which has the following distribution: ∀i ∈ [m] : v = ei · kuk1 w.p. kuk . 1 m Suppose we are given some explicit linear function h(x) : K → R+ , a series of inputs x1 ,x2 ,...,xT , and we would like to generate a corresponding series of `1 -samplings of h(xt ) for t = 1,2,...,T . Naively this would take O(T nm) operations since for each xt we need to calculate h(xt ) in order to generate its `1 -sampling, and this costs O(nm) operations per input. In this section we will show that by using preprocessing we can generate the desired series of `1 -samplings using only a total of O (nm + T n log m) operations. In order to describe the algorithm we first introduce a few definitions. Definitions: • For a range of s numbers I = {k, k + 1, ..., k + s − 1}, we will define Split(I) to be the two ranges of numbers obtained by splitting I ”in the middle” - k, .., k + d 2s e − 1 and k + d 2s e, ..., k + s − 1 (When I consists of only one number Split(I) = I). • We will define I to be all the ranges we get by applying the Split function on [m] and continuing recursively. Formally, denote the set Um = {{1} , {2} , ..., {m}}, and construct I in the following way: 1. Initialize: I1 = {[m]} , i = 1. 2. While Ii 6= Um do: Ii+1 ← {Split(I)|I ∈ Ii } , i ← i + 1 i S 3. I ← Ij j=1
Notice that the number of elements in I is bounded by 2m. • For a set of numbers I ⊂ [m] we will define the function hI (x) =
P
hi (x)
i∈I
We will note that the total number of iterations in the construction of I is bounded by 2 log m. In addition, for every iteration in the construction |Ij | ≤ 2j−1 . We now describe the recursive `1 sampling algorithm. The following lemmas state the correctness and running time of the Recursive `1 -Sampling. The first addresses the important case in which the functions hi are explicitly given linear functions. The second lemma states the correctness and running time for a more general setting, although the technique may be adapted for other settings as well. 26
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Algorithm 12 Recursive `1 Sampling 1: Preprocessed: hI for all I ∈ I. 2: Input: xt 3: Initialize: I = [m] 4: while |I| > 1 do 5: I1 , I2 ← Split(I) 6: αI1 ← hI1 (xt ), αI2 ← hI2 (xt ), αI ← αI1 + αI2 7: 8: 9:
Randomly update I in the following way: I ←
( I1
w.p.
I2
w.p.
end while Return vt = ei h[m] (xt ) = ei |h(xt )|1 , for i such that I = {i}.
α I1 αI α I2 αI
Lemma 3.1.1. (Correctness and running time in the linear case) Suppose hi are explicit linear functions. Then, (Running time) The preprocessing for the recursive `1 -sampling can be carried out in O(nm) operations, and given the preprocessing, every time the recursive `1 -sampling is carried out, it runs O (log m) iterations and takes O(n log m) operations. (Correctness) Let u = recursive `1 − sampling(h(x) according to algorithm 12. Then the distribution of u is equal to the distribution hi (x) . obtained by `1 -sampling of h(x), that is, ∀i ∈ [m] : u = ei |h(x)|1 w.p. |h(x)| 1 Lemma 3.1.2. (Correctness and running time in a more general setting) Let G be a set of convex (concave) functions over K ⊂ Rn which is parametrised by a vector a ∈ A ⊂ Rd , that is: G = {ga (x) : K → R}a∈A . Assume that A is closed under addition ( ∀a1 , a2 ∈ A : a1 + a2 ∈ A) and G is linear in its parameter: ∀a1 , a2 ∈ A, x ∈ K : ga1 (x) + ga2 (x) = ga1 +a2 (x). Suppose that for any a ∈ A and any x ∈ K the evaluation of ga (x) can be done in at most TG operations. Finally assume that for all i ∈ [m] it holds that hi (x) = gai (x) for some given ai ∈ A, and for all i ∈ [m] these are non-negative functions. In this case we claim the following: (Running time) The preprocessing for the recursive `1 -sampling can be carried out in O(dm) operations, and given the preprocessing, every time the recursive `1 -sampling is carried out, it runs O (log m) iterations and takes O(TG log m) operations. (Correctness) Let u = recursive `1 − sampling(h(x) according to algorithm 12. Then the distribution of u is equal to the distribution hi (x) . obtained by `1 -sampling of h(x), that is, ∀i ∈ [m] : u = ei |h(x)|1 w.p. |h(x)| 1 We will present the proof of the correctness which is the same for both lemmas and an outline of the proof for the running time. Proof. (Correctness) Let i ∈ [m]. Notice that the probability that vt = ei |h(xt )|1 equals to the probability that at the end of the algorithm I = {i}. Consider the decision tree of all the possible choices done by the algorithm, and notice that there exists exactly one leaf that is {i} and so there exists one path from the root to this leaf. Denote the different values of I in this path: I1 = [m], I2 , ..., Ik = {i}. Since there is only one such path, the probability that at the end of the 27
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
algorithm I = {i}, is the probability that the algorithm has chosen this path which equals to k−1 ^ The algorithm chose to update I to be Ii+1 at the i-th = Pr i=1 iteration of the algorithm The algorithm chose to up- k−1 Y Pr date I to be Ii+1 at the i-th Ii = i=1 iteration of the algorithm k−1 Y hIi+1 (xt ) hi (xt ) hI (xt ) hi (xt ) = = k = hIi (xt ) hI1 (xt ) h[m] (xt ) |h(xt )|1 i=1
(Running Time) For a given xt , algorithm 12 runs O(log m) iterations. Using the preprocessing each iteration takes O(n) operations in the linear case and TG operations in the more general case. This gives gives a total running time of O(n log m) for the linear case and O(TG log m) in the more general case.
Recursive `1 sampling for time-varying functions: updating the recursive `1 sampling data structure Notice that in the case of generalised covering, algorithm 11 performs `1 -sampling on the set of functions f |0Et which changes as Et changes. This implies that we should change the data structure created by the preprocessing when the set of functions changes. Updating the data structure to correspond to f |0Et+1 based on the current state of the data structure which corresponds to f |0Et is done using the following algorithm. Algorithm 13 Recursive `1 Sampling Data Structure Update 1: Input: f |0Et for all I ∈ I, ∆E = Et \ Et+1 I 2: for j ∈ ∆E do 3: for I ∈ I s.t. j ∈ I do 4: f |0Et+1 ← f |0Et I − fj I 5: end for 6: end for The following lemma states the correctness and running time of the recursive `1 sampling data structure update in the sped-up framework. The running time in this case relies on the fact that for all ∀0 ≤ t < T : |Et \ Et+1 | ≤ 1. The lemma refers only to the case in which fi are explicit linear functions. Similar lemmas can be proven for some other setting as well. Lemma 3.1.3. (Recursive `1 -sampling data structure update correctness and running time) Suppose fi are explicit linear functions. In the fast framework for covering (algorithm 11) if for all 0 ≤ t, the data structure is updated with ∆E = Et \ Et+1 , then after this update the data structure contains the data for the functions f |0Et+1 . In this case, every time the update is carried out it takes O(n log m) operations. 28
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
3.2
Amortized Complexity
When using the `1 sampling, the actual gradients we are using, vt , contain only one non-zero entry, (which is an index of a “live” expert in the case of generalised coverings). The main advantage of this is that the changes between iterations are small, and by remembering variables from previous iterations, the runtime complexity of an iteration can be reduced. In particular, algorithm 11 can be implemented such that for each iteration: a. Calculating the function pTt+1 f takes O(n) operations. b. The recursive `1 -sampling data structure update takes O(n log m) operations. c. An iteration in the MWU algorithm takes O(1) operations. d. Checking whether the while loop condition is satisfied or not can be carried out using O(1) operations. From the above we get the following: Lemma 3.2.1. Suppose fi are explicit linear functions. Algorithm 11 can be implemented such that each iteration takes O(n log m) operations plus one maximisation of a linear function over the domain K. For the detailed description of the efficient implementation the reader is referred to the appendix. However, we will mention that the implementation is done without much sophistication, and requires mainly looking carefully into the details of the algorithm and carrying it out wisely.
3.3
Analysis of the Sped-up Framework
We now develop the proof for theorem 3.0.3 - the theorem regarding the running time and correctness of the generalised covering framework. A discussion about the proof for theorem 3.0.2 for the generalised packing case which is similar in many ways, can be found in the appendix. Proof. (Theorem 3.0.3) Note: The proofs for all the lemmas stated in this proof could be found in the appendix. Denote by T the value of t at the end of the algorithm. First we notice that the number of iterations T , is bounded. m . Lemma 3.3.1. T ≤ 2Km = O m log 2 The following lemmas give a bound on the distortion created by the sampling, in a multiplicative sense, and their proofs rely on a variation of a Multiplicative-Azuma-like inequality stated in 6.0.17 in the appendix (based entirely on lemma 10 from [15]). Lemma 3.3.2. (1 − )
T P t=1
pTt ωt (f (xt )|0Et ) ≤
Lemma 3.3.3. (1 − ) min
T P
p∈∆m t=1
pT vt ≤ min
T P t=1
pTt vt +
T P
p∈∆m t=1
3
w.p at least 1 − e−3
pT ωt (f (xt )|0Et ) +
3 log m
w.p at least 1 − e−3
Notice that the probability that the events in both lemmas hold is larger than 34 . From the regret bound of the MWU algorithm (theorem 2.1.1) we get: 29
Lemma 3.3.4. (1 − η)
T P
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
t=1
T P
pTt vt ≤ min
p∈∆m t=1
p T vt +
log m η .
Rearraning 3.3.3 we get: T P
min
p∈∆m t=1
pT ωt (f (xt )|0Et ) T P
min
p∈∆m t=1
≥1−− pT v
t
3 log m T P · min p T vt
(3.1)
p∈∆m t=1
Using 3.3.2 and 3.3.4 we get: min
T P
p∈∆m t=1 T P t=1
pT vt ≥ 1 − 2 −
pTt ωt (f (xt )|0Et )
log m + (1 − )3 T P · min pT vt
(3.2)
p∈∆m t=1
Now, min pT f (¯ x)
p∈∆m
min
p∈∆m t=1
≥
λ∗
T P
pT ωt f (xt )
T P
λ∗
Entry-wise concavity of f (x) ωt
t=1
(1 − ) min
p∈∆m t=1
≥
T P
(1 − ) min
p∈∆m t=1
T P t=1
pT ωt f (xt ) Choice of xt
pTt ωt f (xt )
t=1 T P
≥
T P
pT ωt (f (xt )|0Et )
pTt ωt (f (xt )|0Et )
≥ 1 − 4 −
9 log m T P · min pT vt
Positiveness of f and ∀t∀i 6∈ Et : pt (i) = 0 Using the multiplication of 3.1 and 3.2
p∈∆m t=1
9 log m Stopping condition of algoK rithm 11 ≥ 1 − 5 Value of K
≥ 1 − 4 −
All we have left to show, is the running time of the sped-up framework in the case fi are linear functions. Since we already bounded the number of iterations, using lemma 3.2.1 we get the desired result.
30
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Chapter 4
Applications 4.1
Normalized Covering SDP
In this section we will show how to approximate the Normalized Covering SDP problem (1.12) in near-linear time. Suppose we are given a Normalized Covering SDP defined by Ai ∈ Sn+ for i = 1, ..., m and ∈ (0, 21 ). Claim 1.1.1 implies that finding an approximate solution to the corresponding Maxmin Positive SDP problem (1.13) is sufficient for finding an approximate solution to the original problem (using linear runtime operations for the reduction itself). We will apply the sped-up framework for generalised covering (Algorithm 11) for the corresponding Maxmin Positive SDP problem. This problem is a special case of the generalized covering problem in which K = {X ∈ Sn+ |Tr(X) ≤ 1} and ∀i ∈ [m] : fi (X) := Ai • X where Ai ∈ Sn+ for all i. Notice that fi are linear functions. The optimization step of our algorithm (line 6 in algorithm 11) for the case of Maxmin Positive SDP amounts to solving the maximum eigenvector problem. The following lemma summarises the running time for this problem, which can be achieved using the Lanczos method (for proof see appendix). Lemma 4.1.1. Let A ∈ Sn+ be a matrice with N non-zero entries. Denote K = {x ∈ Sn+ |Tr(X) ≤ N ˜ √ a matrix 1}. Then there is an algorithm that with high probability returns in total time of O
X ∈ K such that A • X ≥ (1 − )maxX 0 ∈K A • X 0 . We can now apply the sped-up framework for this case. Corollary 4.1.1. Let K = {X ∈ Sn+ |Tr(X) ≤ 1}, Then, Algorithm 11 with input {Ai }m i=1 , K, ¯ ∈K which uses the algorithm from lemma 4.1.1 as a multiplicative approximation oracle, returns X 3 such that with probability at least 4 : ¯ ≥ (1 − O()) max min Ai • X min Ai • (X) 0 X ∈K i∈[m]
i∈[m]
˜ O
The2 algorithm can be implemented such that the running time of the algorithm in this case is mn 2.5
31
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
¯ is a 1 + O()-multiplicative approximation to the corresponding NorThe matrix min 1 A •X¯ X i∈[m] i malized Covering SDP problem.
4.2
Non-negative Linear Classifier
We will apply the sped-up framework for generalised covering (Algorithm 11) for the Non-negative linear classifier (problem 1.15). This problem is a special case of the generalized covering problem in which K = Bn and f (x) = Ax where A ∈ Rm×n , and so again fi are linear functions. + The optimization step of our algorithm (line 6 in Algorithm 11) for the case of Non-negative linear classifier amounts to the problem of normalising a vector in Rn , which takes O(n) operations. Using this and Theorem 3.0.3 we get the following result. Corollary 4.2.1. Let A ∈ Rm×n and let < 21 . Algorithm 11 with input A , Bn , returns x ¯∈K + 3 such that with probability at least 4 : min A¯ x ≥ (1 − O()) max min Ax x∈Bn i∈[m]
i∈[m]
˜ O
The algorithm can be implemented such that the running time of the algorithm in this case is
mn 2
32
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Chapter 5
Lower Bounds In this chapter we give some lower bounds on the running times of multiplicative approximation algorithms for several problems, all of which are special cases of the generalised packing or generalised covering problems with linear functions. Note that the lower bounds are only in terms of the input size, i.e. m and n, but not in terms of the approximation parameter, . The bounds we give are based on the following conjecture: Conjecture 5.0.1. Consider a matrix of dimensions n × d such that with probability 1/2 it is a NO instance - each row of the matrix is a random unit vector (i.e. all zeros except for one entry of which the value is 1) chosen uniformly and independently between different rows, and with the remaining 1/2 probability it is a YES instance - one row is chosen randomly to be zero, and the remaining rows are randomly chosen unit vectors (in a similar manner to the first case the first case). Then, any algorithm that with probability at least 2/3 determines whether the matrix is a YES instance or a NO instance (i.e. it contains a row of zeros or not), must read Ω(nd) entries of the matrix.
5.1
Lower Bounds for Zero Sum Games
Definition 5.1.1. We will call an algorithm A a multiplicative approximation algorithm for the decision problem of covering ZSG if for all inputs of the form A ∈ Rn×d + , , λ the algorithm returns with probability at least 2/3 x ∈ ∆d such that Ax ≥ λ(1 − ) (entry-wise) if such exists, and declares failure if there is no such x ∈ ∆d . Definition 5.1.2. We will call an algorithm A a multiplicative approximation algorithm for the decision problem of packing ZSG if for all inputs of the form A ∈ Rn×d + , , λ the algorithm returns with probability at least 2/3 x ∈ ∆d such that Ax ≤ λ(1 + ) (entry-wise) if such exists, and declares failure if there is no such x ∈ ∆d . Lemma 5.1.1. Any multiplicative approximation algorithm for the decision problem of covering ZSG has running time Ω(nd) 33
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Proof. Let A be such an algorithm. Let us create an algorithm for determining the YES/NO instances introduced in 5.0.1: Given a matrix A run A on A, = 1/2, λ = 1/d. If A declared failure return YES (the instance contains a row of zeros), otherwise return NO. Notice that if the matrix is a NO instance then for x = d1 d it holds that Ax = d1 n ≥ λ(1 − ) but if the matrix is a YES instance Ax will always contain an entry which equals to zero and so smaller than λ(1 − ) and so could not fulfil the desired inequalities. We get that the algorithm we introduced determines between YES and NO instances, and so according to 5.0.1 its running time is Ω(nd) . Lemma 5.1.2. Any multiplicative approximation algorithm for the decision problem of packing ZSG has running time Ω(nd) Proof. Let A be such an algorithm. Let us create an algorithm for determining the YES/NO instances introduced in 5.0.1: Given a matrix A, define B = AT and run A on B, = 1/2, λ = 0. If A declared failure return NO (the instance A contains a row of zeros), otherwise return YES. Notice that if A is a NO instance then for any x ∈ ∆n it holds that at least one entry in Bx is not smaller than n1 and so it could not fulfil all desired inequalities. On the other hand, if A is a YES instance, with the j-th row of all zeros, then for x = ej it holds that Bx = 0d ≤ λ(1 + ). We get that the algorithm we introduced determines between YES and NO instances, and so according to 5.0.1 its running time is Ω(nd).
5.2
Lower Bounds for Non-Negative Linear Classifier
Definition 5.2.1. We will call an algorithm A a multiplicative approximation algorithm for the n×d decision problem of Non-negative Linear Classifier if for all inputs of the form A ∈ R+ , , λ the algorithm returns with probability at least 2/3 x ∈ Bd such that Ax ≥ λ(1 − ) (entry-wise) if such exists, and declares failure if there is no such x ∈ Bd . Lemma 5.2.1. Any multiplicative approximation algorithm for the decision problem of Non-negative Linear Classifier has running time Ω(nd) Proof. Let A be such an algorithm. Let us create an algorithm for determining the YES/NO √ instances introduced in 5.0.1: Given a matrix A run A on A, = 1/2, λ = 1/ d. If A declared failure return YES (the instance contains a row of zeros), otherwise return NO. Notice that if the matrix is a NO instance then for x = √1 it holds that Ax = √1 ≥ λ(1 − ) but if the matrix is dd dn a YES instance Ax will always contain an entry which equals to zero and so smaller than λ(1 − ) and so could not fulfil the desired inequalities. We get that the algorithm we introduced determines between YES and NO instances, and so according to 5.0.1 its running time is Ω(nd) .
34
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-09 - 2016
Chapter 6
Appendix 6.0.1
Regret of the MWU Algorithm
In this part we prove lemma 2.1.1 showing that despite the modification the MWU algorithm (algorithm 4) can still have the same regret bound guarantee as the original MW algorithm in the case relevant to our framework in which the costs are always non-negative. We will assume from here on that: • {vt }Tt=1 is a series of non-negative cost vectors for which ∀t∀i ∈ Et : vt (i) ≤ 1. • ET +1 6= ∅ The next lemma says, intuitively, that two series of cost vectors which always agree on the costs of ”live” experts will produce the same series of outputs. Lemma 6.0.2. Let {vt0 }Tt=1 be another series of non-negative cost vectors, and denote by {p0t , wt0 , Et0 }Tt=1 the corresponding variables of MW-K when running on this series. If ∀t∀i ∈ Et : vt0 (i) = vt (i) then: ∀t : Et0 = Et , ∀i ∈ Et : wt0 (i) = wt (i), p0t = pt . Proof. First let us show that ∀t : Et0 = Et . Let 1 ≤ t ≤ T + 1. If i ∈ Et then from the monotonicity P 0 P vs (i) < K and so i ∈ Et0 . If of Es for all 1 ≤ s < t it holds that i ∈ Es and so: vs (i) = 1≤s