Norm-Based Capacity Control in Neural Networks - Semantic Scholar

Report 2 Downloads 77 Views
Norm-Based Capacity Control in Neural Networks Behnam Neyshabur, Ryota Tomioka, Nathan Srebro Toyota Technological Institute at Chicago

Can we control the capacity of NNs independent of num. of params? Linear  Predic+on   Parameter-­‐Based  Capacity   (VC  dimension)   Norm-­‐Based  Capacity  

Feed-forward neural network:

f : RD ! R

specified by directed acyclic graph G(V,E) and weights w.

Feed-­‐forward  Networks   poly(#edges  )

Independent  of  #param?   Under  what  condi+ons?  

Group norm regularization 0

µp,q (w ) = @

✓ X X

v 2V

All  the  nodes   (units)

(u!v )2E

|w (u ! v )|

All  the  incoming   edges  to  v

Complexity of a function f:

µp,q (f ) = inf µp,q (w ) fG,w =f

p

◆q/p

11/q A

Rademacher complexity bound For any depth d, 1≤p≤2, q≥1, and S={x1,…,xm}⊆RD

Rm



d,H, RELU Nµp,q µ



r



µ2d (2H [1

(1/p+1/q)]+ )2(d 1)

m ✓ ◆ ignoring terms polynomial The  class  of  func+ons   in log(D) and maxi kxi kp⇤ that  can  be  wriHen  as   ReLU  network  with   bounded  μp,q  norm

•  Independent of width H (max in-degree of a vertex) if 1/p+1/q≥1. •  Exp dependence on depth d (length of the longest directed path)

Both H and d dependencies are tight! (see our paper)

Per-unit regularization (q=∞) •  For layered graphs, per-unit Lp regularization is equivalent to path-based regularization:

0

B p (w ) = @

e1

X e2

ed

vin [i]!v1 !v2 ···!vout

Every path from an input unit to the output

d Y

k=1

11/p

pC

|w (ek )| A

Per-unit regularization (q=∞) •  For layered graphs, per-unit Lp regularization is equivalent to path-based regularization:

0

B p (w ) = @

e1

X e2

ed

vin [i]!v1 !v2 ···!vout

d Y

k=1

11/p

pC

|w (ek )| A

•  Path regularization is invariant to rescaling of the network ⇒ steepest descent wrt path-regularizer empirically more efficient than SGD and AdaGrad [New paper: arXiv 1506.02617] •  H (width) independent sample complexity only for p=1.

Overall regularization (p=q) •  Capacity independent of width H for p=q≤2. •  Theorem:

Nµ2,2,2RELU 2 = conv ({ The  class  of  two-­‐layer   ReLU  networks  with     bounded  overall  L2  norm

RELU (hw , xi)

: kw k2  1})

Convex  hull  of  (infinite)   rec+fied  linear  units

That is, overall L2 regularization is equivalent to Convex Neural Nets [Bengio 2005; Bach 2014], which employs per-unit L2 for the first layer and L1 for the second layer.

Conclusions •  It is possible to control the complexity of NNs independent of the number of parameters for p and q not too large. •  More results (please read our paper) –  Our bound is tight. –  Low-norm class is convex if 1/p+(d-1)/q≤1. –  Hardness of optimization.