Norm-Based Capacity Control in Neural Networks Behnam Neyshabur, Ryota Tomioka, Nathan Srebro Toyota Technological Institute at Chicago
Can we control the capacity of NNs independent of num. of params? Linear Predic+on Parameter-‐Based Capacity (VC dimension) Norm-‐Based Capacity
Feed-forward neural network:
f : RD ! R
specified by directed acyclic graph G(V,E) and weights w.
Feed-‐forward Networks poly(#edges )
Independent of #param? Under what condi+ons?
Group norm regularization 0
µp,q (w ) = @
✓ X X
v 2V
All the nodes (units)
(u!v )2E
|w (u ! v )|
All the incoming edges to v
Complexity of a function f:
µp,q (f ) = inf µp,q (w ) fG,w =f
p
◆q/p
11/q A
Rademacher complexity bound For any depth d, 1≤p≤2, q≥1, and S={x1,…,xm}⊆RD
Rm
⇣
d,H, RELU Nµp,q µ
⌘
r
µ2d (2H [1
(1/p+1/q)]+ )2(d 1)
m ✓ ◆ ignoring terms polynomial The class of func+ons in log(D) and maxi kxi kp⇤ that can be wriHen as ReLU network with bounded μp,q norm
• Independent of width H (max in-degree of a vertex) if 1/p+1/q≥1. • Exp dependence on depth d (length of the longest directed path)
Both H and d dependencies are tight! (see our paper)
Per-unit regularization (q=∞) • For layered graphs, per-unit Lp regularization is equivalent to path-based regularization:
0
B p (w ) = @
e1
X e2
ed
vin [i]!v1 !v2 ···!vout
Every path from an input unit to the output
d Y
k=1
11/p
pC
|w (ek )| A
Per-unit regularization (q=∞) • For layered graphs, per-unit Lp regularization is equivalent to path-based regularization:
0
B p (w ) = @
e1
X e2
ed
vin [i]!v1 !v2 ···!vout
d Y
k=1
11/p
pC
|w (ek )| A
• Path regularization is invariant to rescaling of the network ⇒ steepest descent wrt path-regularizer empirically more efficient than SGD and AdaGrad [New paper: arXiv 1506.02617] • H (width) independent sample complexity only for p=1.
Overall regularization (p=q) • Capacity independent of width H for p=q≤2. • Theorem:
Nµ2,2,2RELU 2 = conv ({ The class of two-‐layer ReLU networks with bounded overall L2 norm
RELU (hw , xi)
: kw k2 1})
Convex hull of (infinite) rec+fied linear units
That is, overall L2 regularization is equivalent to Convex Neural Nets [Bengio 2005; Bach 2014], which employs per-unit L2 for the first layer and L1 for the second layer.
Conclusions • It is possible to control the complexity of NNs independent of the number of parameters for p and q not too large. • More results (please read our paper) – Our bound is tight. – Low-norm class is convex if 1/p+(d-1)/q≤1. – Hardness of optimization.