Fractional Norm Regularization: Learning With ... - Semantic Scholar

Report 6 Downloads 139 Views
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

1

Fractional Norm Regularization: Learning With Very Few Relevant Features Ata Kabán

Abstract— Learning in the presence of a large number of irrelevant features is an important problem in high-dimensional tasks. Previous studies have shown that L1-norm regularization can be effective in such cases while L2-norm regularization is not. Furthermore, work in compressed sensing suggests that regularization by nonconvex (e.g., fractional) semi-norms may outperform L1-regularization. However, for classification it is largely unclear when this may or may not be the case. In addition, the nonconvex problem is harder to solve than the convex L1 problem. In this paper, we provide a more in-depth analysis to elucidate the potential advantages and pitfalls of nonconvex regularization in the context of logistic regression where the regularization term employs the family of Lq semi-norms. First, using results from the phenomenon of concentration of norms and distances in high dimensions, we gain intuition about the working of sparse estimation when the dimensionality is very high. Second, using the probably approximately correct (PAC)Bayes methodology, we give a data-dependent bound on the generalization error of Lq-regularized logistic regression, which is applicable to any algorithm that implements this model, and may be used to predict its generalization behavior from the training set alone. Third, we demonstrate the usefulness of our approach by experiments and applications, where the PAC-Bayes bound is used to guide the choice of semi-norm in the regularization term. The results support the conclusion that the optimal choice of regularization depends on the relative fraction of relevant versus irrelevant features, and a fractional norm with a small exponent is most suitable when the fraction of relevant features is very small. Index Terms— Distance concentration, fractional norm, high dimensionality, PAC-Bayes analysis, sparse classification.

I. I NTRODUCTION

W

E ARE interested in high-dimensional problems, with few relevant features and small sample sizes. This is the typical setting, e.g., in genomic and proteomic array classification [24]. L1-regularization has become a workhorse in statistical machine learning, because of its sparsity-inducing property and its convenient convexity. Detailed theoretical and empirical analysis [28] has also shown its ability to learn with exponentially many irrelevant features, in the context of L1-regularized logistic regression. However, independent results in compressed sensing and other areas seem to indicate that nonconvex regularizers may have an added value despite the presence of local optima in the optimization objective. Recent work in statistics [9], Manuscript received January 9, 2012; revised December 16, 2012; accepted February 3, 2013. The author is with the School of Computer Science, University of Birmingham, Birmingham B15 2TT, U.K. (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2247417

[12], [37], signal reconstruction [35], and compressed sensing [6]–[8], [30], [40] have considered nonconvex penalties and established the oracle properties of nonconvex regularizers— i.e., with proper choice of regularization parameters, such nonconvex penalty-based estimators can recover sparse vectors as accurately as if the identity of relevant features were known. Furthermore, good empirical results were also reported using nonconvex penalties in support vector machine (SVM) classification [5], [34], and a greedy method to optimize the L0-norm directly was also proposed [10], [17], [33]. However, the success of these methods for learning and classification appears to be data dependent in experimental studies [25], [34]. It has been noted that sparsity by itself does not guarantee good generalization [10], [34], and excessive sparsity can be detrimental to the generalization performance [32]. It remains largely unclear under what conditions nonconvex regularization is preferable to the more classical L1 regularization for classification problems. Furthermore, since the nonconvex problem is harder to solve than the convex L1 problem, answering this issue is difficult. In this paper, we pursue a more in-depth analysis, and elucidate the potential advantages and pitfalls of sparsityinducing nonconvex regularizers for learning problems. We will focus on logistic regression for concreteness, and we adopt the family of nonconvex Lq semi-norms with q < 1 (sometimes referred to as fractional norms) as the representatives of nonconvex penalties in our formulation. The family of Lq semi-norms represents one of several alternative surrogates of the discrete L0-norm,1 along with SCAD, thresholding rules, log-penalty, and lasso [9], [12], and may be used for variable selection. The same Lq family was also proposed and studied as a replacement for the Euclidean distance in the data-mining literature concerned with the phenomenon of distance concentration in high dimensions [11]. Adopting the Lq-norm regularizer will allow us to gain insights from putting together the lessons learnt in these two different contexts. Furthermore, the Lq norm has a useful Bayesian interpretation, which allows us to employ the probably approximately correct (PAC)-Bayesian framework to develop a datadependent and algorithm-dependent generalization bound. This bound can be used to guide the choice of algorithm and the choice of q in the exponent of the Lq regularization term. Since the use of nonconvex penalties requires the solution of 1 The zero-norm of a vector w is defined as the number of its nonzero elements, i.e., w0 = card{wi |wi  = 0}. Minimization of the zero-norm is conceptually the “ideal” formulation of feature selection, but unfortunately it is a hard combinatorial problem [2].

2162–237X/$31.00 © 2013 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

a difficult nonconvex optimization problem, there is a surge of new algorithms being developed and proposed to solve such problems [12]. Our PAC-Bayes bound is applicable to any algorithm that implements the same model, and may be used to predict its generalization behavior from the training set alone. A. Regularized Logistic Regression Consider a training set of input–target pairs S = {(x j , y j )}nj =1 drawn i.i.d. from some unknown data distribution D, where x j ∈ Rm are m-dimensional input points and y j ∈ {−1, 1} are their labels. For concreteness, we focus on regularized logistic regression max w

n 

log p(y j |x j , w) s.t. ||w||q ≤ A

(1)

j =1

where p(y|wT x) = 1/(1 + exp(−ywT x)) is parameterized by w ∈ R1×m , m is the data dimensionality and n is the sample size. The norm that forms the regularization term is defined as  ||w||q =

m 

1 q

|wi |

q

.

(2)

i=1

Note, if q = 2 or q = 1, this is L2- or L1-regularized regression, respectively. The generalization ability and sample complexity of these two models have been comprehensively studied in [28], showing the impressive superiority of L1 on problems with exponentially many irrelevant features. In turn, if q ∈ (0, 1), we have a nonconvex regularization term, which we will refer to as L q 0, lim P[ max ||w j ||q ≤ m→∞

1≤ j ≤N

(1 + ) min ||w j ||q ] = 1, where the probability is over the 1≤ j ≤N

draw of the random sample of size N. Proof: From the result in [4], a sufficient condition of the stated limit is to have for some p > 0 that limm→∞ Var[(||w||q ) p ]/E[(||w||q ) p ]2 = 0. Denoting the term ( p) under this limit as RVm , and choosing p = q, we have n n m q q q i=1 j =1 Cov[|wi | , |w j | ] + i=n+1 Var[|wi | ] (q) m m RVm = q q i=1 j =1 E[|wi | ]E[|w j | ] where we used the independence of wn+1 , . . . , wm and the independence of wi , i ∈ {n + 1, . . . , m} from w j , j ∈ (q) {1, . . . , n}. Now, this sequence (RVm )m=1,... converges to 0 as m → ∞ since n is finite. Hence, in increasingly high dimensions, the regularization term can no longer distinguish between the possible optimizers of the unregularized likelihood. In fact, simulations in [4] indicate that the problem may become of concern already at 10–20 dimensions. Fig. 1 illustrates the vanishing contrast between the norms of a sequence of increasingly high dimensional vectors with entries drawn from a uniform distribution. Fortunately, not all norms concentrate equally fast as the dimensionality increases. In particular, the family of norms used in our regularization term was studied in this respect in [1], [11], which will give us further insight into the effect of q. For this part of the analysis, we chose p = 1. Theorem 2 (extension of François et al. [11]): If w ∈ Rm has no more than n < m non-i.i.d. components where n is finite, then under the modeling assumptions (3) and (4) we have Var[||w||q ] 1 σ2 = (5) lim m m→∞ E[||w||q ]2 q 2 μ2 where μ = E[wn+1 ], σ 2 = Var[wn+1 ] and the index n + 1 refers to one of the i.i.d. dimensions of w.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. KABÁN: FRACTIONAL NORM REGULARIZATION

3

w ~ Unif[0,1] 0.25

Sample estimate of Var[||w||2] / E[||w||2]

2

L2−norms L1−norms L0.5−’norms’ L −’norms’ 0.1

0.2

0.15

0.1

0.05

0

0

5

10

15

20

25

30

35

40

45

50

dimensions (m)

Fig. 1. Relative variance of L q -norms as the dimensionality increases. The limit is zero eventually, but fractional semi-norms with low exponent converge slowest.

Proof: To allow for a finite number of possibly m non-i.i.d. q 1/m marginal distributions, we write lim m→∞ i=1 |wi | = n m q q limm→∞ 1/m i=1 |wi | +( i=n+1 |wi | /(m −n))(m −n) =  q limm→∞ m n)  and limm→∞ Var i=n+1 |wi | /(m −  q n n q q [||w|| ]/m = lim 1/m Cov(|w m→∞ i | , |w j | )+ q i=1 j =1 m q q = Var[|wn+1 | ]. Using these, the i=n+1 Var(|wi | ) rest of the proof follows the same steps as the original theorem—this gives that limm→∞ E[||w||q ]/m 1/q = μ1/q , and limm→∞ Var[||w||q ]/m 2/q−1 = σ 2 /(q 2 μ2(q−1)/q ). Put together, we get (5). Corollary 1: Under the model assumptions (3) and (4), in large m & small n settings, the L q=0 -norm regularizer maximizes the relative contrast Var[||w||q ]/E[||w||q ]2 . (1) Proof: For large m, we can approximate RVm = 2 Var[||w||q ]/E[||w||q ] using (5) as Var[||w||q ]/E[||w||q ]2 ≈ σ 2 /(mq 2 μ2 ) and read off the optimal q by computing the maximizer of this expression. Computing μ and σ 2 for wn+1 ∼ Unif[ − a, a] yields  a 1 aq μ = E[|wn+1 |q ] = |wn+1 |q dwn+1 = 2a q+1 −a 2q q 2 a σ 2 = E[|wn+1 |2q ] − E[|wn+1 |q ]2 = . (2q + 1)(q + 1)2 So we have Var[||w||q ] E[||w||q

]2



1 1 m 2q + 1

(6)

which rather conveniently turns out to be independent of a. Most importantly, we see that (6) is a monotonically decreasing function of q. Note from the proofs that the ingredient that causes the concentration of norms is a growing number of i.i.d. entries in w that overtake a finite number of non-i.i.d. components. This is exactly the situation when a growing number of “irrelevant” (noise) features overtake a small number of “relevant” (structure-bearing) features. It follows that, in a scenario with an increasing number of irrelevant input dimensions, the smallest q is the one that can best slow down the concentration effect.

This suggests indeed that the ideal exponent would be q = 0 in such cases—that is, the optimal choice is the zero-norm— in agreement with what is commonly considered to be the norm that conveys the ideal sparse concept [6], [17], [34]. Note, however, that this conclusion is only valid under the assumptions under which it was derived, namely when the number of i.i.d. (noise) features grows without bounds while the number relevant features, i.e., those features that contain structure, is fixed and finite. On one hand, this provides some novel insight into sparse learning; on the other, the scenario involved can only be expected to be a good model for real situations when very high-dimensional data contains only very few relevant features. Consequently, this analysis, while insightful, says nothing about cases that do not match this scenario, for instance, when we have a large number of partially relevant features. For such cases, further analysis in needed, which we do in the next section. We will see, in accordance with recent empirical findings [23], that in the case of a larger fraction of relevant features, the optimal q is not the smallest one. In the next section we further analyze Lq-regularization in logistic regression using the PAC-Bayesian framework originally proposed by McAllester [26]. This framework is very general and allows us to derive a generalization guarantee for any choice of q in a data-dependent and algorithm-dependent manner, while it remains distribution-free (i.e., holds for any unknown data distribution, assuming an i.i.d. sampling of the data). III. PAC-BAYES A NALYSIS The main result of this section is to upper-bound the generalization error of L q≤1 -regularized logistic regression by applying the PAC-Bayes methodology. Contrary to classical PAC-bounds that provide worst case guarantees about the learnability of a problem, PAC-Bayesian bounds are data-dependent and algorithm-dependent a posteriori PAC-bounds—that is, the numeric value of the bound is specific to the implementation algorithm and the training set at hand. However, the analytic form of the bound is generic; in particular, the bound we will derive in this section can take any algorithm that implements the nonconvex problem of Lq-regularized logistic regression with any chosen q ∈ (0, 1). The implementation that we used in the reported experiments is based on local quadratic approximations and is detailed in Appendix A. The first step will be to interpret our classification method in terms of Bayesian modeling assumptions, as a probabilistic model that includes prior knowledge (encoded by the regularization term in our case). The obtained bound is valid for any choice of prior knowledge and any unknown data distribution, but whenever the prior knowledge is a good match to the unknown data distribution, then the bound will be tight and vice versa. It should be remembered that PAC-Bayes bounds provide guarantees on the future performance of a given algorithm trained on a given training set on the basis of its expected training error only, and without the use of a hold-out set. Moreover, the provided guarantees are nontrivial

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

(i.e., can fail with a probability of less than 1); hence, given the moderate sizes of training data, they can be tight enough for practical use. In the PAC-Bayesian setting, a classifier is defined by a distribution Q(w) over the hypothesis space. Each sample from this distribution represents a classifier or hypothesis parameterized by w, h w : Rm → {0, 1}, where in our case h w (x) = w T x. We will drop the index w whenever it does not cause confusion. We are interested in the expected generalization error Q D ≡ Pr( x ,y)∼D {h(x) = y} in terms ofits gap from the expected empirical error Qˆ S ≡ Eh∼Q [1/n ni=1 I (h(x i ) = yi )], where both expectations are taken w.r.t. Q(w)—this eliminates the probabilistic element. The distribution Q(w) is often called the “posterior” distribution. Although it does not need to be a Bayesian posterior, it is typically defined as a function of the training sample. The mentioned gap between the expected generalization error and expected training error will then be expressed as a function of the gap between the “prior” and the “posterior” distributions over the space of classifiers. Theorem 3 (Generic PAC-Bayes Theorem [21], [26], [31]): = Qˆ S ln( Qˆ S /Q D ) + (1 − Define K L + [ Qˆ S ||Q D ] ˆ ˆ Q S ) ln((1 − Q S )/(1 − Q D )) for Q D ≥ Qˆ S (and 0 otherwise) the gap between expected generalization error and expected training error, and K L(Q||P) = Eh∼Q ln(Q(h)/P(h)) the gap [Kullback–Leibler (KL) divergence] between the posterior and the prior. For all i.i.d. distributions D over the data, ∀P(h) prior over the classifiers h: X → Y , and ∀δ ∈ (0, 1) risk, we have Pr n ∀Q(h): K L + [ Qˆ S ||Q D ] S∼D

1 n+1 ≤ K L(Q||P) + ln ≥ 1 − δ. (7) n δ In the remainder of this section, we will derive an instantiation of this framework for our class of Lq-regularized logistic regression classifiers. The key technical ingredient is to reinterpret the optimization problem (26) in a Bayesian sense, as an MAP estimate of logistic regression with a generalized Gaussian distribution (GGD) (known also as generalized Laplacian) prior— noting that the regularization term is, up to a constant, the log

m of independent zero-mean multivariate GGDs, P(w) = j =1 GGD(w j |0, λ, q) where 1

  qλ q   exp −λ|w j − μ j |q . GGD(w j |μ j , λ, q) = 1 2 q We are interested in the case q ∈ (0, 1), so the log prior is nondifferentiable at zero, resulting in sparse estimates. So unlike the work of [31], which develops a PAC-Bayes bound for Gaussian process classifiers, our model priors are of course non-Gaussian.

mAccordingly, we will choose Q of the form ˆ j , λ˜ , q), ˜ i.e., a product of GGDs centered j =1 GGD(w j |w on the parameter estimates w ˆ (this is a function of the training sample through w), ˆ and where λ˜ , q˜ are parameters that will be chosen to tighten the bound.

Using this interpretation, we now proceed to derive the expressions for the complexity term K L[Q||P] and the expected training error Qˆ S for our model. A. The K L[Q||P] Term Since the priors on the  parameter entries are independent, we have K L[Q||P] = mj=1 K L[Q(w j )||P(w j )], where K L[Q(w j )||P(w j )]

(8) ˜ ˜ (9) = K L(GGD(w j |wˆ j , λ, q)||GGD(w j |0, λ, q))   ˜ GGD(w j |wˆ j , λ, q) ˜ = Ew j ∼GGD(w j |wˆ j ,λ, log . (10) ˜ q) ˜ GGD(wj |0, λ, q)

For q = 1, this integral is analytically tractable and was computed in [20]. For q < 1, unfortunately, it becomes intractable, in general. To get round of this, we will create an analytic upper bound, using the q-triangle inequality (valid for 0 < q ≤ 1) (11) |a + b|q ≤ |a|q + |b|q to lower bound GGD(w j |0, λ, q) in the denominator GGD(w j |0, λ, q) =

qλ1/q exp(−λ|w j |q ) 2(1/q)

(12)

qλ1/q exp(−λ|w j − wˆ j |q ) · exp(−λ|wˆ j |q ) (13) 2(1/q) = GGD(w j |wˆ j , λ, q) · exp(−λ|wˆ j |q ). (14) ≥

Plugging this into (10), we get KL[Q(w j )||P(w j )] ≤ Ew j ∼GGD(w j |wˆ j ,λ, ˜ q) ˜   ˜ q) GGD(w j |wˆ j , λ, ˜ log GGD(w j |wˆ j , λ, q) exp(−λ|wˆ j |q ) ˜ q)||GGD(w = KL(GGD(w j |wˆ j , λ, ˜ ˆ j , λ, q)) + λ|wˆ j |q . j |w Now, the remaining KL divergence is between two GGD densities that have the same mean parameter. This can be computed using elementary integration, and gives ˜ q)||GGD(w ˜ ˆ j , λ, q)) KL(GGD(w j |wˆ j , λ, j |w λ ((q + 1)/q) 1 q˜ λ˜ 1/q˜ (1/q) ˜ + − . (15) = log 1/q qλ (1/q) ˜ (1/q) ˜ q˜ λ˜ q/q˜ Put together, we have the needed upper bound on the complexity term q˜ λ˜ 1/q˜ (1/q) qλ1/q (1/q) ˜ λ ((q + 1)/q) 1 ˜ + (16) − + λ|wˆ j |q . ˜λq/q˜ (1/q) ˜ q˜ Observe that only the KL terms of the nonzero components of w ˆ are affected by any nontightness of this inequality, since for components with wˆ j = 0, the above inequality is satisfied with equality. It is quite instructive to look at the meaning of the expression (16). In the case of low expected training error Qˆ S , this is the dominant term in the overall bound (7); in fact, Q D is roughly proportional to K L(Q||P) in that case [31]. Since the bound (7) holds for any λ˜ and q, ˜ we can choose KL[Q(w j )||P(w j )] ≤ log

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. KABÁN: FRACTIONAL NORM REGULARIZATION

5

these to tighten the bound. The minimum of (16) is achieved by setting λ˜ = λ and q˜ = q, which gives the following simple expression: (17)

˜ q˜ λ,

So in the case of a low Qˆ S , the zero components of the estimate w ˆ incur no cost—which is nice, considering that we are dealing with a sparse classifier. It is also interesting to observe that, for the special case of q = 1, this expression recovers the PAC-Bayes result on the L1-regularized classifier obtained in [20]. From (17), we see that the capacity of the classifier is a function of both the regularization parameter λ and the sparsity of the estimate w ˆ in the sense of its Lq-norm. The bound is tight when the solution estimate is sparse with many zeros and few large-magnitude components. It becomes loose if we set a large λ parameter (make a large bet on sparsity), but get a nonsparse estimate w ˆ in return. Furthermore, as q gets smaller toward zero, the absolute magnitude of the estimates counts less, hence the bound (17) becomes less sensitive to the regularization parameter λ when the solution is indeed sparse, although more sensitive when the solution is not sparse even if the magnitudes of the nonzero components are small. Finally, we should point out that, beyond the case of a very low expected training error, the KL is just one of the terms, and the overall bound on the generalization error Q D can be further tightened by managing the tradeoff between the KL term and Qˆ S term, i.e., taking both into account when optimizing the parameters λ˜ and q. ˜ This optimization is carried out numerically from (7) once we have an estimate for the Qˆ S term, which we derive in the sequel. B. The Qˆ S Term The expected training error under the distribution Q(w)

= Eh∼Q =

[h(x) = y]

Pr

h∼Q,( x ,y)∼S  n 

1 n

I (h(x i ) = yi )

(18)  (19)

i=1

n 1 Pr [h(x i ) = yi ] h∼Q n

(21)

j =1

q min K L[Q||P] ≤ λ||w|| ˆ q.

Qˆ S =

the true error is larger) is given by the binomial tail     k Pr p j (1 − p)k− j (#err ≤ ) = #err∼Bin(k, p) j

(20)

i=1

is analytically intractable to compute because of the nonGaussian continuous distribution Q(w). 1) Monte Carlo Sampling Bound on Qˆ S : One way to approximate Qˆ S is to bound the tail of a Monte Carlo evaluation [21] as follows. Draw a set of classifiers {w1 , . . . , w k } i.i.d. from Q(w), and denote the observed error rate of these k random hypotheses applied to a random training example by /k. Since the classifiers have been drawn independently, the distribution of the observed errors is binomially distributed. Hence, the probability of the misleading event that we only observe no more than  errors out of k independent trials (when in fact

where the binomial parameter p is the unknown true parameter. The RHS of (21) is monotonic decreasing function of p and it is invertible. We put the RHS equal to δ/2 and there exists a unique p = p(δ/2) that satisfies this equation. Although the solution p(δ/2) cannot be expressed analytically, it can be found numerically (e.g., by a binary search). The obtained solution p(δ/2) represents the largest binomial parameter (error rate) under which we have confidence at least 1 − δ/2 that the observed error rate (#err/k) will not fall into a tail of size δ/2. That is, #err/k will be larger—except with a risk of δ/2—than the observed rate of /k. Hence, with probability at least 1−δ/2, Qˆ S is upperbounded by p(δ/2). Finally, plugging this bound into (7), we numerically solve this equation for Q D , i.e., the generalization error, allowing a further risk of δ/2 in order to ensure that the overall bound holds w.p. 1 − δ. 2) Analytic Approximation of Qˆ S : A computationally cheaper alternative to the above Monte Carlo based approach is to develop an analytic Gaussian approximation to the term Qˆ S , motivated by the central limit theorem. For the case q = 1 (Laplace priors), a similar approach was taken in [20], which we extend to q ≤ 1. Making use of the form of the classification function, Qˆ S in (20) may be written as n 1 Pr Qˆ S = (yi w T x i < 0). ˜ q) n w∼GGD(w|wˆ ,λ, ˜

(22)

i=1

Each of the summands is given by thecumulative density function of yi w T x i . Now, yi w T x i = yi mj=1 w j x i j is a sum of m independent random variables wi (most of which are zero-mean when the estimate w ˆ is sparse). By the central limit theorem for independent random variables, when  m is large and w ˆ is sparse, the distribution of the sum yi mj=1 w j x j will be close to Gaussian, so we can use the Gaussian cumulative density function (cdf) to approximate the terms in (22). The Gaussian approximation of the distribution of yi w T x i is N (E[yi w T x i ], Var[yi w T x i ]), where it is easy to find that E[yi w T x i ] = yi

m 

wˆ j x i j = yi w ˆ T xi

(23)

j =1

˜ 1 (3/q) x i 2 . ˜ λ˜ 2/q˜ (1/q) yields the following

Var[yi w T x i ] = Replacing these approximation:

⎫ ⎧ n T ⎬ ⎨  ˆ xi 1 yi w  −  Qˆ S ≈ ⎭ ⎩ ˜ −1/q˜ (3/q) ˜ n λ i=1 (1/q) ˜ ||x i ||2

(24) analytic

(25)

where (.) is the standard Gaussian cdf. Empirically, we found this approximation is fairly accurate, as we shall see in Section IV (Fig. 2).

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

1 relevant feature of 1000

0.5

0.3

0.2 0.3 0.4 Q MC upper bound

0.5

S

0.48

Q analytic approx S

Q analytic approx S

QS analytic approx

0.4

0.2

0.46 0.44 0.42 0.4

0.45

0.4

0.35

0.42 0.44 0.46 0.48 0.5 0.52 QS MC upper bound

0.4 0.45 0.5 Q MC upper bound S

Agreement between the Monte Carlo sampling-based upper bound on Qˆ S and the analytic approximation given in (25).

PAC−Bayes bound, δ=0.1

A. Results on Synthetic Data

exp decay f.rel., m=20; n=500

1 relevant f. of m=20; n=500

IV. N UMERICAL S IMULATIONS

PAC−Bayes bound, δ=0.1

Fig. 2.

exp decay relevance

3 relevant features of 1000 0.5

0.5

1) Data Generation: In this section we validate our theoretical findings on synthetic data in the first instance. We generated a battery of datasets following [28]. The inputs were drawn from a multivariate standard normal distribution, and the labels were generated using a logistic model with true parameter values w set so as to obtain three types of datasets as follows: 1) one relevant feature: w1 = 10, wi = 0, ∀i > 1; 2) three relevant features: w1 = w2 = w3 = 10/3, wi = 0, ∀i > 3; and 3)√exponentially decaying relevance of features: wi = (1/2)i−1 75, ∀i ≥ 1. This way the scaling of the problem remains the same, so that the Bayes error remains the same. Further, for each type, the data dimension was varied from 10 to 1000. 2) Validating the PAC-Bayes Bound: First, we convince ourselves that our analytic approximation for the term Qˆ S , as given in (25), is in agreement with the more computationally intensive Monte Carlo sampling based upper bound. Fig. 2 shows the scatter plots of these two quantities from experiments conducted on 1000-D synthetic datasets with 100 points each. We see that the agreement is remarkable. Therefore, the subsequent tests will employ the analytic approximation. Next, we assess the behavior of our PAC-Bayes bound against empirical performance measured on an independent test set in Fig. 3. We are interested to see if the overall behavior (i.e., monotonicity, minima) of the predicted generalization error matches with that of the true generalization error as a function of the fraction of relevant versus irrelevant features. In this set of experiments, we use 20-D datasets of 500 points each in order to allow the bounds to be quantitatively meaningful.2 As before, we have a setting with a single relevant feature (left) and one with exponentially decaying feature relevance (right). We evaluate and plot the PAC-Bayes bounds at δ = 0.1 for a grid of possible choices of the regularization parameters λ ∈ [0.01, 20] and q ∈ [0.1, 1] respectively (upper plots). That is, for each of such pair of values for (λ, q) considered, we numerically optimize the two parameters of the bound, i.e., λ˜ and q, ˜ to tighten the bound. This optimization was carried out by searching over a suitably large range: q˜ ∈ [0.1, 2],

λ˜ ∈ [0.01, 200], to make sure we obtain a tight-enough bound.3 We see from Fig. 3 that the parameter predictions are quite different for the two types of data. The lowest q tested is predicted to be best in the first case (along with a large λ), whereas in the second case a larger exponent, q = 0.7, is predicted to work best (along with a smaller λ). This agrees well with our analysis in Section II. For the best q predicted by these bounds, we then show on the lower plots the actual empirical error rate measured on an independent test set, the upper endpoint of its 90% confidence interval (also termed as the test-set bound [21]) using 30% of the data held out, and we superimpose these with our PACBayes bound for the various choices of λ. Note that both the empirical error rates and their test set bound require access to an independent test set which was not used in training the classifier—on the contrary, the PAC-Bayes bound uses only the training sample (i.e., it is a train set bound). Its key theoretical and practical advantage lies with its ability to model and

2 Recall that PAC-Bayes bounds try to predict the true generalization error from the training set alone. Hence we need a more sizeable training set for such bounds to be tight enough.

3 We used a simple grid search here, and should note that more sophisticated searches might possibly be able to further tighten our bounds or/and reduce computation time.

0.3 0.2 5

0.1 10 0.8

0.6

15 0.4

20

0.2

λ

0.35 0.3 0.25 0.2

5 10

1

0.8

0.6

15 0.4

q

q

PAC−Bayes bound Testset bound Frac.test errors

0.4

λ

exp decay f. rel; q=0.7

1 rel.f.; q=0.2 0.5

20

0.2

0.3 0.25 0.2

0.3

0.15 0.2

0.1

0.1 0

0.05 5

10 λ

15

20

0

0

5

10 λ

15

20

Fig. 3. Experiment with m = 20 features and n = 500 samples. The lower plots show a comparison of the PAC-Bayes bound at the slice of the best q predicted by the bound, against the fraction of empirical test error rates and their 90% confidence bound (i.e., the test set bound). For the latter two, 30% of the data was held out and used for estimating the test error.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. KABÁN: FRACTIONAL NORM REGULARIZATION

200 400 600 800 1000

200 400 600 800 1000

200 400 600 800 1000

0.3 0.2 0.1

Validation logloss

q=0.1

0.25 0.2 0.15 0.1 0.05

0.4 0.3 0.2

200 400 600 800 1000 q=0.3

q=0.5

200 400 600 800 1000 q=0.7

Validation logloss

0.1

Test logloss

Test logloss

0.2

0.3 0.2 0.1 200 400 600 800 1000 Data dimensions

4 2 0

0.5 q

1

exp decay relevance

14 12 10 8 6 4 2 0

0.5 q

1

14 12 10 8 0

0.5 q

1

Fig. 5. Empirical behavior summarizing the experiments with λ set by crossvalidation. Test errors refer to counts out of 100. The error bars represent one standard error over all independent trials with dimensionality ranging from 10 to 1000. We see that L0 regularization performs better than L 1 ; however, L q