Classification of Manifolds by Single-Layer Neural Networks SueYeon Chung,1, 2 Daniel D. Lee,3 and Haim Sompolinsky2, 4, 5, ∗ 1
arXiv:1512.01834v1 [cond-mat.dis-nn] 6 Dec 2015
Program in Applied Physics, School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA 2 Center for Brain Science, Harvard University, Cambridge, MA 02138, USA 3 Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA 19104, USA 4 Racah Institute of Physics, Hebrew University, Jerusalem 91904, Israel 5 Edmond and Lily Safra Center for Brain Sciences, Hebrew University, Jerusalem 91904, Israel The neuronal representation of objects exhibit enormous variability due to changes in the object’s physical features such as location, size, orientation, and intensity. How the brain copes with the variability across these manifolds of neuronal states and generates invariant perception of objects remains poorly understood. Here we present a theory of neuronal classification of manifolds, extending Gardner’s replica theory of classification of isolated points by a single layer perceptron. We evaluate how the perceptron capacity depends on the dimensionality, size and shape of the classified manifolds. PACS numbers: 87.18.Sn, 87.19.lt, 87.19.lv
A fundamental neural model of classification is the perceptron. The perceptron thresholds a linear weighted sum of input components, giving rise to a decision hyperplane that is orthogonal to the weight vector [1, 2]. This model of linear classification is the basis of many modern machine learning architectures including the well-known support vector machines [3]. A theoretical understanding of the perceptron was pioneered by Elizabeth Gardner who formulated it as a statistical mechanics problem and analyzed it using replica theory [2, 4–6]. Perceptron theory considers the classification of a finite set of input vectors, each corresponding to a point in neuronal state space. However, high-level perception involves classifying or identifying objects, each represented by a continuum of neuronal states[7, 8]. These perceptual manifolds arise from different physical instantiations of the objects including variations in intensity, location, scale, and orientation. Coping with this variability is one of the major tasks of the hierarchy of sensory systems. Recent experiments have shown that object representations in the highest level of the visual hierarchy, the inferotemporal (IT) cortex, are amenable to simple linear readout, in contrast to earlier stages of the visual hierarchy [7–10]. Similar observations apply to object representations in the hierarchy of artificial deep neural networks for object recognition [11–14]. These observations raise the question of what properties of neuronal perceptual manifolds are consistent with processing by simple downstream classifiers such as the perceptron. In this work, we generalize Gardner’s statistical mechanical analysis and establish a theory of linear classification of manifolds [2, 4] synthesizing statistical and geometric properties of high dimensional signals. We apply the theory to simple classes of manifolds: line segments, low dimensional balls, and 2-D L1 balls. We show how the
maximum number of manifolds that can be classified by a perceptron and the nature of the resulting solution depend on the manifolds’ dimensionality, size, and shape. Line segments: We begin with the problem of classifying a set of P line segments in N dimensions, all having length 2R (Fig. 1). These segments can be written as {xµ + Rsuµ }, −1 ≤ s ≤ 1, µ = 1, ..., P . The N -dimensional vectors xµ ∈ RN and uµ ∈ RN denote respectively, the centers and directions of the µ-th segment, and the scalar s parameterizes the continuum of points along the segment. The parameter R measures the extent of the segments relative to the distance between the centers. We seek to partition the different line segments into two classes defined by binary labels y µ = ±1 . To classify the segments, a weight vector w ∈ RN must obey y µ w · (xµ + Rsuµ ) ≥ κ for all µ and s. The parameter κ ≥ 0 is known as the margin; in general, a larger κ indicates that the perceptron solution will be more robust to noise and display better generalization properties [3]. Hence, we are interested in maximum margin solutions, i.e., weight vectors w that yield the maximum possible value for κ. Since line segments are convex, only the endpoints of each line segment need to be checked, namely min hµ0 ±Rhµ = hµ0 − R |hµ | ≥ κ where hµ0 = ||w||−1/2 y µ w · xµ are the fields induced by the centers and hµ = ||w||−1/2 y µ w · uµ are the fields induced by the line directions. Replica theory: The existence of a weight vector w that can successfully classify the line segments depends upon the statistics of the segments. We consider random line segments where the components of xµ and uµ are i.i.d. normally distributed with zero mean and unit variance, and random binary labels y µ . We study the thermodynamic limit where the dimensionality N → ∞ and
implying that neither of the two segment endpoints reach the margin. In the other extreme, when t0 < κ − R−1 |t|, the minimum is given by z0 = κ − t0 and z = − |t|, i.e. hµ0 = κ and hµ = 0 indicating that both endpoints of the line segment lie on the margin planes. In the intermediate regime where κ − R−1 |t| < t0 < κ + R |t|, z0 = κ − t0 but z > − |t|, i.e. hµ0 = κ but hµ > 0, corresponding to only one of the line segment endpoints touching the margin. In this regime, the solution is given by minimizing the function (R |z + t| + κ − t0 )2 + z 2 with respect to z. Combining these contributions, we can write the perceptron capacity of line segments: ˆ ∞ ˆ κ+R|t| 2 (R |t| + κ − t0 ) α1−1 (κ, R) = Dt Dt0 R2 + 1 −∞ κ−R−1 |t| −1 ˆ ∞ ˆ κ−R |t| + Dt Dt0 (κ − t0 )2 + t2 (3)
FIG. 1: (a) Linear classification of points. (solid) points on the margin, (striped) internal points. (b) Linear classification of line segments. (solid) lines embedded in the margin, (dotted) lines touching the margin, (striped) interior lines. (c) Capacity α = P/N of a network N = 200 as a function of R with margins κ = 0 (red) and κ = 0.5 (blue). Theoretical predictions (lines) and numerical simulation (markers, see SM for details) are shown. (d) Fraction of different line configurations at capacity with κ = 0. (red) lines in the margin, (blue) lines touching the margin, (black) internal lines.
−∞
with integrations over the Gaussian measure, Dx ≡ 1 2 √1 e− 2 x dx. It is instructive to consider special lim2π its. When R → 0, Eq. (3) reduces to α1 (κ, 0) = α0 (κ) where α0 (κ) is Gardner’s original capacity result for perceptrons classifying P points (zero-dimensional manifolds) with margin √ κ. Interestingly, when R = 1, then α1 (κ, 1) = 21 α0 (κ/ 2). This is because when R = 1 there are no statistical correlations between the line segment endpoints and the problem becomes equivalent √ to classifying 2P random points with average norm 2N . Finally, when R → ∞, the capacity is further reduced: α1−1 (κ, ∞) = α0−1 (κ)+1. This is because when R is large, the segments become unbounded lines. In this case, the only solution is for w to be orthogonal to all P line directions. The problem is then equivalent to classifying P center points in the N −P null space of the line directions, so that at capacity P = α0 (κ)(N − P ). We see this most simply at zero margin, κ = 0. In this case, Eq. (3) reduces to a simple analytic expression for the capacity: α1−1 (0, R) = 21 + π2 arctan R (SM). The capacity is seen to decrease from α1 (0, R = 0) = 2 to α1 (0, R = 1) = 1 and α1 (0, R = ∞) = 32 for unbounded lines. We have also calculated analytically the distribution of the center and direction fields hµ0 and hµ [15]. The distribution consists of three contributions, corresponding to the regimes that determine the capacity. One component corresponds to line segments fully embedded in these planes. The fraction of these manifolds is simply the volume of phase space of t and t0 in the last term of Eq. (3). Another fraction, given by the volume of phase space in the first integral of (3) corresponds to line segments touching the margin planes at only one endpoint. The remainder of the manifolds are those interior to the margin planes. Fig. 1 shows that our theoretical calculations correspond nicely with our numerical simulations for the perceptron capacity of line segments, even with modest input dimensionality N = 200. Note that
number of segments P → ∞ with finite α = P/N and R. Following Gardner [4] we compute the average log volume hln V i where V is the volume of the space of perceptron solutions: ˆ P Y V = dN w Θ (hµ0 − R |hµ | − κ) . (1) kwk2 =N
−∞
µ=1
Θ(x) is the Heaviside step function. According to replica theory, the fields are described as sums of random Gaussian fields hµ0 = tµ0 + z0µ and hµ = tµ + z µ where t0 and t are quenched components arising from fluctuations in the input vectors xµ0 and uµ respectively, and the z0 , z fields represent the variability in hµ0 and hµ resulting from different solutions of w. These fields must obey the constraint z0 + t0 − R |z + t| ≥ κ. The capacity function α1 (κ, R) describes for which P/N ratio the perceptron solution volume shrinks to a unique weight vector. The reciprocal of the capacity is given by the replica symmetric calculation (details provided in supplementary materials, SM): 1 2 z0 + z 2 (2) α1−1 (κ, R) = min z0 +t0 −R|z+t|≥κ 2 t0 ,t where the average is over the Gaussian statistics of t0 and t. To compute Eq. (2), three regimes need to be considered. First, when t0 is large enough so that t0 > κ+R |t|, the minimum occurs at z0 = z = 0 which does not contribute to the capacity. In this regime, hµ0 > κ and hµ > 0 2
as R → ∞, half of the manifolds lie in the plane while half only touch it; however, the angles between these segments and the margin planes approach zero in this limit. D-dimensional balls: We now consider linear classification n of D-dimensional o balls embedded in N -dimensions, PD xµ + R i=1 si uµi , so that the µ-th manifold is centered at the vector xµ ∈ RN and its extent is described by a set of D basis vectors uµi ∈ RN , i = 1, ..., D . The points in each manifold are parameterized by the Ddimensional vector ~s ∈ RD whose Euclidean norm is constrained by: k~sk ≤ 1. Statistically, all components of xµ and uµi are i.i.d. Gaussian random variables with zero mean and unit variance. We define hµ0 = N −1/2 y µ w·xµ as the field induced by the manifold centers and hµi = N −1/2 y µ w · uµi as the D fields induced by each √ of the basis vectors and with normalization kwk = N . To classify all the points on the manifolds correctly with margin κ, w ∈ RN must satisfy the inequality hµ0 − R||~hµ || ≥ κ where ||~hµ || is the Euclidean norm of the D-dimensional vector ~hµ whose components are hµi . This corresponds to the requirement that the field induced by the points on the µ-th manifold with the smallest projection on w be larger than the margin κ.
FIG. 2: Random D-dimensional balls: (a) Linear classification of D = 2 balls. (b) Fraction of 2-D ball configurations as a function of R at capacity with κ = 0, comparing theory (lines) with simulations (markers). (red) balls embedded in the plane, (blue) balls touching the plane, (black) interior balls. (c) Linear classification of balls with D = N at margin κ (black circles) is equivalent √ to point classification of centers with effective margin κ + R N (purple points). (d) Capacity α = P/N√for κ = 0 for large D = 50 and R ∝ D−1/2 as a func√ tion of R D. (blue solid) αD (0, R) compared with α0 (R D) (red square). (Inset) Capacity α at κ = 0 for 0.35 ≤ R ≤ 20 and D = 20: (blue) theoretical α compared with approximate form (1 + R−2 )/D (red dashed).
We solve the replica theory in the limit of N, P → ∞ with finite α = P/N , D, and R. The fields for each of the manifolds can be written as sums of Gaussian D ~ quenched and entropic components, t0 ∈ R, t ∈ R D and z0 ∈ R, ~z ∈ R , respectively. The capacity for D-dimensional manifolds is given by the replica symmetric calculation (SM):
* −1 αD (κ, R)
i 1h 2 2 min z0 + k~zk t0 +z0 −Rk~ t+~ z k>κ 2
=
1− D
+ . (4) t0 ,~ t
The capacity calculation can be partitioned into
three regimes. For large t0 > κ + Rt, where t = ~t , z0 = 0 and ~z = 0 corresponding to manifolds which lie interior to the margin planes of the perceptron. On the other hand, when t0 < κ − R−1 t, the minimum is obtained at z0 = κ − t0 and ~z = −~t corresponding to manifolds which are fully embedded in the margin planes. Finally, in the intermediate regime, when κ−R−1 t < t0 < κ+Rt,
z0 = R ~t + ~z − t0 + κ but ~z 6= −~t indicating that these manifolds only touch the margin plane. Decomposing the capacity over these regimes and integrating out the angular components, the capacity of the perceptron can be written as: ˆ
ˆ
∞
To probe the fields, we consider the joint distribution of the field induced by the center, h0 , and the norm
of the
fields induced by the manifold directions, h ≡ ~h . Corresponding to the capacity calculation, there are three contributions. The first term corresponds to h0 −Rh > κ, i.e. balls that lie interior to the perceptron margin planes; the second component corresponds to h0 − Rh = κ but h > 0, i.e. balls that touch the margin planes; and the third contribution represents the fraction of balls obeying h0 = κ and h = 0, i.e. balls fully embedded in the margin. The dependence of these contributions on R for D = 2 is shown in Fig. 2(b).
2
κ+Rt
(Rt + κ − t0 ) 1 R2 + 1 0 κ− R t ˆ ∞ ˆ κ− R1 t h i 2 + dt χD (t) Dt0 (κ − t0 ) + t2 (5)
−1 αD (κ, R) =
Dt0
dt χD (t)
0
1 2
where χD (t) = 2Γ( D2) tD−1 e− 2 t is the Chi probability 2 density function for radius t in D dimensions. For large −1 R → ∞, Eq. (5) reduces to: αD (κ, ∞) = α0−1 (κ) + D which indicates that w must be in the null space of the P D basis vectors {uµi } in this limit. This case is equivalent to the classification of P points (the projections of the manifold centers) by a perceptron in the N − P D dimensional null space.
−∞
3
In a number of realistic problems, the dimensionality D of the object manifolds could be quite large. Hence, we analyze the limit D 1. In this situation, for the ca1 pacity to remain finite, R has to be small: R ∝ D− 2 . In this regime, we find that the capacity is αD (κ, D) ≈
√ α0 (κ + R D). In other words, the problem of separating P high dimensional balls with margin κ is equivalent to √ separating P points but with a margin κ + R D. This is because when the distance of the closest point on the D-dimensional ball to √ the margin plane is κ, the distance of the center is κ+R D (see Fig. 2). In particular, when the balls are full rank, i.e. √ D = N dimensional spheres, N so that the radius of the then R must be of order 1/ √ balls, R N , is finite. When R is larger, the capacity vanishes as αD (0, R) ≈ 1 + R−2 /D.
FIG. 3: L1 balls: (a) Linear classification of 2-D L1 balls. (b) Fraction of manifold configurations as a function of radius R at capacity with κ = 0 comparing theory (lines) to simulations (markers). (red) entire manifold embedded, (blue) manifold touching margin at a single vertex, (gray) manifold touching with two corners (one side), (purple) interior manifold.
When D is large, making w orthogonal to a significant fraction of high dimensional manifolds incurs a prohibitive loss in the effective dimensionality. Hence, in the limit of large D, the fraction of manifolds that lie in the margin plane√is zero. Interestingly, when R is sufficiently large, R ∝ D, it becomes advantageous for w to be orthogonal to a finite fraction of the manifolds.
similarly, Lp balls with p < 1 have the same capacity as p = 1 balls with the same dimensionality and radius.
Lp balls: To study the effect of changing the geometrical shape of the manifolds, we replace the Euclidean norm constraint on the manifold boundary by a constraint on their Lp norm. Specfically, we o consider Dn PD µ µ dimensional manifolds x + R i=1 si ui where the D dimensional vector ~s parameterizing points on the manifolds is constrained: k~skp ≤ 1. For 1 < p < ∞, these Lp manifolds are smooth and convex. Their linear classification by a vector w is determined by the field constraints hµ0 − R||~hµ ||q ≥ κ where, as before, hµ0 are the fields induced by the centers, and ||~hµ ||q , q = p/(p − 1), are the Lq dual norms of the D-dimensional fields induced by uµi (SM). The resultant solutions are qualitatively similar to what we observed with L2 ball manifolds.
We focused here on the classification of fully observed manifolds and have not addressed the problem of generalization from finite input sampling of the manifolds. Nevertheless, our results about the properties of maximum margin solutions can be readily utilized to estimate generalization from finite samples. The current theory can be extended in several important ways to consider more complex models for invariant perception of objects. Additional geometric features can be incorporated, such as non-uniform radii for the manifolds as well as heteogeneous mixtures of manifolds. The influence of correlations in the structure of the manifolds as well as the effect of sparse labels can also be considered. We anticipate that the present work will lay the groundwork for a computational theory of neuronal processing of perceptual manifolds and provide quantitative measures for assessing the properties of empirical object representations in biological and artificial neural networks.
However, when p ≤ 1, the convex hull of the manifold becomes faceted, consisting of vertices, flat edges and faces. For these geometries, the constraints on the fields associated with a solution vector w becomes: hµ0 − R maxi |hµi | ≥ κ for all p < 1 . We have solved in detail the case of D = 2 (SM). There are four manifold classes: interior; touching the margin plane at a single vertex point; a flat side embedded in the margin; and fully embedded. The fractions of these classes are shown in Fig. 3.
Helpful discussions with Remi Monasson and Uri Cohen are acknowledged. The work is partially supported by the Gatsby Charitable Foundation, the Swartz Foundation, the Simons Foundation (SCGB Grant No. 325207), the NIH, and the Human Frontier Science Program (Project RGP0015/2013). D. Lee also acknowledges the support of the US National Science Foundation, Army Research Laboratory, Office of Naval Research, Air Force Office of Scientific Research, and Department of Transportation.
Discussion: We have extended Gardner’s theory of the linear classification of isolated points to the classification of continuous manifolds with simple geometric shapes. We showed how to use replica theory to compute the perceptron capacity and field distributions as a function of the task margin and manifold geometry. Our analysis shows how the dimensionality and size of the manifolds can profoundly affect the overall capacity and determine the relationship between the manifolds and separating margin planes. Our theory shows that linear separability of manifolds depends intimately upon the geometry of their convex hulls. For this reason, only the statistics of the endpoints of line segments for D = 1 are relevant;
∗ Correspondence (
[email protected]) [1] M. L. Minsky and S. A. Papert, Perceptrons - Expanded Edition: An Introduction to Computational Geometry (MIT press Boston, MA:, 1987). [2] E. Gardner, Europhysics Letters 4, 481 (1987). [3] V. Vapnik, Statistical learning theory, vol. 1 (Wiley New York, 1998). [4] E. Gardner, Journal of physics A: Mathematical and gen-
4
eral 21, 257 (1988). [5] A. Engel, C. Van den Broeck, and C. Broeck, Statistical mechanics of learning (Cambridge University Press, 2001). [6] M. Advani, S. Lahiri, and S. Ganguli, Journal of Statistical Mechanics: Theory and Experiment 2013, P03014 (2013). [7] J. J. DiCarlo and D. D. Cox, Trends in cognitive sciences 11, 333 (2007). [8] M. Pagan, L. S. Urban, M. P. Wohl, and N. C. Rust, Nature neuroscience 16, 1132 (2013). [9] C. P. Hung, G. Kreiman, T. Poggio, and J. J. DiCarlo, Science 310, 863 (2005). [10] W. A. Freiwald and D. Y. Tsao, Science 330, 845 (2010).
[11] T. Serre, L. Wolf, and T. Poggio, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR (IEEE, 2005), vol. 2, pp. 994–1000. [12] I. Goodfellow, H. Lee, Q. V. Le, A. Saxe, and A. Y. Ng, in Advances in neural information processing systems (2009), pp. 646–654. [13] M. A. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR (IEEE, 2007), pp. 1–8. [14] Y. Bengio, Foundations and Trends in Machine Learning 2, 1 (2009). [15] L. F. Abbott and T. B. Kepler, Journal of Physics A: Mathematical and General 22, 2031 (1989).
5