Learning Boolean functions with concentrated spectra Dustin G. Mixon and Jesse Peterson Department of Mathematics and Statistics, Air Force Institute of Technology Wright-Patterson Air Force Base, Ohio 45433, USA
arXiv:1507.04319v1 [cs.LG] 15 Jul 2015
ABSTRACT This paper discusses the theory and application of learning Boolean functions that are concentrated in the Fourier domain. We first estimate the VC dimension of this function class in order to establish a small sample complexity of learning in this case. Next, we propose a computationally efficient method of empirical risk minimization, and we apply this method to the MNIST database of handwritten digits. These results demonstrate the effectiveness of our model for modern classification tasks. We conclude with a list of open problems for future investigation. Keywords: classification, Boolean functions, sparsity, MNIST
1. INTRODUCTION The last year has produced several breakthroughs in classification. Deep neural networks now match human-level performance in both facial recognition19 and general image recognition,10 and they also outperform both Apple and Google’s proprietary speech recognizers.8 These advances leave one wondering about their implications for the general field of machine learning. Indeed, Google and Facebook have actively acquired top talent in deep neural networks to pursue these leads.6, 11 As a parallel pursuit, many have sought theoretical justification for the unreasonable effectiveness of deep neural networks.3–5, 17 Unfortunately, the theory is still underdeveloped, as we currently lack a theoretical grasp of the computational complexity of learning with modern deep neural networks. Answers to such fundamental questions will help illustrate the scope of these emerging capabilities. Observe that neural networks resemble circuit implementations of Boolean functions. Indeed, a circuit amounts to a directed acyclic graph with n input nodes and a single output node, along with intermediate nodes that represent Boolean logic gates (such as ANDs, ORs, threshold gates, etc.). As such, one may view circuits as discrete analogies for neural networks. To date, there is quite a bit of theory behind the learnability of Boolean functions with sufficiently simple circuit implementations.12, 16, 18 The main idea is that such functions enjoy a highly concentrated Fourier transform due to a clever application of Hastad’s Switching Lemma.9 Passing through the analogy, one might then hypothesize that the real-world functions that are well approximated by learnable deep neural networks also enjoy a highly concentrated Fourier transform—this hypothesis motivates our approach. This paper discusses the theory and application of learning Boolean functions that are concentrated in the Fourier domain. The following section provides some background material on statistical learning theory to help set the stage for our investigation. We then prove in Section 3 that the sample complexity of learning Boolean functions of concentrated spectra is small. Section 4 proposes a learning algorithm as a first step towards tackling the computational complexity, and then Section 5 illustrates how well our model performs on a real-world dataset (namely, the MNIST database of handwritten digits14 ). We conclude in Section 6 with a list of open problems.
2. BACKGROUND The objective is to estimate an unknown labeling function f : {±1}n → {±1}. A sample {xi }`i=1 ⊆ {±1}n is drawn i.i.d. according to some unknown distribution p, and we receive the labeled training set {(xi , f (xi ))}`i=1 . The quality of our estimate fˆ: {±1}n → {±1} will be evaluated in terms of the risk functional X R(fˆ, f ) := 1{fˆ(x)6=f (x)} p(x). x∈{±1}n
Send correspondence to Dustin G. Mixon: E-mail:
[email protected] Risk is commonly approximated by empirical risk using a random sample {yi }m i=1 drawn i.i.d. according to p: m
Ry (fˆ, f ) :=
1 X 1 ˆ . m i=1 {f (yi )6=f (yi )}
In practice, empirical risk is used to evaluate an estimate fˆ with the help of a labeled test set (which is disjoint from the training set). Similarly, it makes sense to pick fˆ in such a way that minimizes empirical risk Rx over the training set. However, the training sample only covers a small fraction of the sample space {±1}n , so how do we decide which values fˆ should take beyond this sample? The trick is to restrict our empirical risk minimization to a “simple” function class C ⊆ {g : {±1}n → {±1}}. Intuitively, if we suspect that f is close to some member fˆ of a small set C, and a large training set happens to nearly match one such member, then chances are small that this occurred by mere coincidence, and so we should expect R(fˆ, f ) to be small. In general, C doesn’t need to be small, as it suffices for C to enjoy a broader notion of simplicity: Definition 1. The function class C ⊆ {g : {±1}n → {±1}} is said to shatter {xi }`i=1 ⊆ {±1}n if for every choice of labels y1 , . . . , y` ∈ {±1}, there exists a function g ∈ C such that g(xi ) = yi . The VC dimension of C is the size of the largest set that C shatters. We consider C to be simple if its VC dimension is small. By counting, it is clear that the VC dimension of C is ≤ log2 |C|, meaning small sets are necessarily simple. The following result illustrates the utility of VC dimension as a notion of simplicity: Theorem 2 (obtained by combining equations (3.15) and (3.23) from Vapnik21 ). Fix C ⊆ {g : {±1}n → {±1}} and let h denote its VC dimension. Pick f : {±1}n → {±1} and draw {xi }`i=1 ⊆ {±1}n i.i.d. according to some (unknown) distribution p. Then with probability ≥ 1 − η, r h(log(2`/h) + 1) − log(η/4) R(g, f ) ≤ Rx (g, f ) + ` for all g ∈ C simultaneously. Here, the probability is on {xi }`i=1 . In words, Theorem 2 states that the risk of an estimate is small provided its empirical risk is small over a sufficiently large training set (namely, ` h). This suggests three properties that we want our function class C to satisfy: • Simple. We want the VC dimension of C to be small, so as to allow for a small sample complexity. • Admits fast optimization. We want empirical risk minimization over C to be computationally efficient. • Models reality. Given an application, we want the true function to be close to some member of C. In the remainder of this paper, we study whether Boolean functions with concentrated spectra form a function class which satisfies these desiderata. To be explicit, the following defines the function class of interest: Definition 3. Let Cn,k denote the class of all functions g : {±1}n → {±1} for which there exist index sets S1 , . . . , Sk ⊆ [n] and coefficients a1 , . . . , ak ∈ R such that ! k X Y g(x) = sign ai xj ∀x ∈ {±1}n . (1) i=1
j∈Si
In the following section, we estimate Cn,k ’s VC dimension. Next, Section 4 proposes a method for performing empirical risk minimization over Cn,k . Finally, we apply this method to the MNIST database of handwritten digits14 in Section 5 to illustrate how well Cn,k models reality (at least in one application).
3. ESTIMATING THE VC DIMENSION Considering Theorem 2, we desire a function class of small VC dimension, as this will allow us to get away with a small training set. In this section, we estimate the VC dimension of Cn,k , the class of functions of the form (1). To this end, it is helpful to identify how (1) is related to the Walsh–Hadamard transform W : `2 (Zn2 ) → `2 (Zn2 ), defined by Pn X (W z)(v) := z(u)(−1) j=1 uj vj ∀v ∈ Zn2 . u∈Zn 2
Taking Su := {j : uj = 1} and xj := (−1)
vj
∈ {±1}, we equivalently have X Y xj ∀x ∈ {±1}n . (W z)(log−1 (x)) = z(u) u∈Zn 2
j∈Su
n
Note that every real polynomial over {±1} is of the form (W z) ◦ log−1 . When the coefficients are k-sparse, taking the sign of this polynomial produces a member of Cn,k . Recall the matrix representation of W , namely W =
1 1
1 −1
⊗n .
The vector W z lists all 2n possible outputs of (W z) ◦ log−1 . As such, we may identify Cn,k with {sign(W z) : kzk0 ≤ k}.
(2)
We use this identification to prove the main result of this section: Theorem 4. The VC dimension of Cn,k is ≤ 2nk − O(k log k). Proof. Since the VC dimension is ≤ log2 |Cn,k |, it suffices to estimate |Cn,k |. Considering (2), this quantity is n the number of orthants in R2 that intersect the union of subspaces {W z : kzk0 ≤ k}. By Theorem V.1 of van 2n −1 der Berg and Friedlander,20 each subspace intersects at most 2 ≤k−1 orthants, and so n n n n n+1 2 2 2 −1 2 2 −1+k−1 2 ·2 ≤ ·2 ≤ . k ≤k−1 k k−1 k b Taking logs of both sides and applying the bound ab ≤ ( ea b ) then gives the result. |Cn,k | ≤
As a complementary result, the following illustrates how tight our bound is: Proposition 5. The VC dimension of Cn,k is ≥ max{n, k}. Proof. It suffices to find a subcollection S of max{n, k} row indices such that {sign(WS z) : kzk0 ≤ k} = {±1}|S| , where WS denotes the |S| × 2n submatrix of rows from W indexed by S. In the case where n ≥ k, let S index the rows of W which have the form wi := (1, 1)⊗(i−1) ⊗ (1, −1) ⊗ (1, 1)⊗(n−i) for some i ∈ [n]. Note that for every u ∈ Zn2 , the corresponding identity basis element is given by eu := where e0 := (1, 0) and e1 := (0, 1). This then implies
Nn
i=1 eui ,
(WS eu )i = heu , wi i = heui , (1, −1)i = (−1)ui . As such, every member of {±1}n is can be expressed as Wn eu for some identity basis element eu . In the remaining case where k ≥ n, observe that W has rank 2n ≥ k, and so there exists a k × k submatrix A of rank k. Let S denote the row indices of A. Then every vector in {±1}k can be expressed as WS z, where z is supported on the column indices of A.
In pursuit of better lower bounds, we need more techniques to tackle the notion of shattering. The proof of the following proposition provides some ideas along these lines: Proposition 6. Given an N × N matrix A, denote CA,k := {sign(Az) : kzk0 ≤ k}. (a) CA,k fails to shatter a set of size O(k 2 ) if A is the Walsh–Hadamard transform matrix and k
√
N.
(b) CA,k fails to shatter a set of size O(k log(N/k)) w.h.p. if the entries of A are i.i.d. N (0, 1). Proof. (a) Define the sign rank of a matrix S with entries in {±1} to be the minimum rank of all matrices M satisfying Mij Sij > 0 for every i, j. Write N = 2n , and observe that each column of A can be reshaped to be a 2bn/2c × 2dn/2e matrix of rank 1. Then reshaping sign(Az) in the same way produces a 2bn/2c × 2dn/2e matrix of sign rank ≤ k. By the left-hand inequality of equation (1) from Alon, Frankl and R¨odl,1 for every m, there exists an m × m matrix S with entries in {±1} of sign rank ≥ m/32. Taking m := 32(k + 1) ≤ 2bn/2c , use the corresponding matrix S to construct a 2bn/2c × 2dn/2e matrix S 0 by padding with ±1s. Then S 0 has sign rank ≥ k + 1 (implying S 0 6∈ CA,k ) regardless of how the padded entries are selected. As such, CA,k fails to shatter these m2 entries. (b) We seek m such that for every f ∈ CA,k , the first m entries of f are not all 1s. Let AI,J denote the submatrix with row indices in I and column indices in J. Then equivalently, we seek m such that for every K ⊆ [N ] with |K| = k, the subspace im(A[m],K ) intersects the nonnegative orthant Rm ≥0 uniquely at the origin. Since im(A[m],K ) is drawn uniformly from the Grassmannian of k-dimensional subspaces in Rm , we may apply Gordon’s Escape Through a Mesh Theorem (namely, Corollary 3.4 of √ Gordon7 ). For this theorem, we use the p fact that the Gaussian width of the positive orthant is m/2 − O(1/ m), as established by Propositions 3.2 and 10.2 in Amelunxen et al.2 Then r 2 1 √ 1 1 7 m m −O √ +√ Pr im(A[m],K ) ∩ R≥0 = {0} ≥ 1 − exp − m−k− . 2 18 2 m−k k Taking m = ck log(N/k) for sufficiently large c, the union bound then gives −Ω(k log(N/k)) Pr im(A[m],K ) ∩ Rm . ≥0 = {0} ∀K ⊆ [N ], |K| = k ≥ 1 − e As such, CA,k fails to shatter the first m entries.
4. EMPIRICAL RISK MINIMIZATION In this section, we consider the problem of empirical risk minimization over Cn,k : Problem 7. Let W denote the 2n × 2n Walsh–Hadamard transform matrix, and let z be some (nearly) k-sparse n vector in R2 . Given a sample x of ` entries of f = sign(W z), find fˆ ∈ Cn,k satisfying Rx (fˆ, f ) ≤ const · min Rx (g, f ) g∈Cn,k
(3)
in poly(n, k, `) time. This problem can be viewed as a combination of the sparse fast Walsh–Hadamard transform15 and one-bit compressed sensing.13 Computationally, the main difficulty seems to be achieving polynomial time in n rather than 2n , and to do so, one must somehow take advantage of the rich structure provided by the Walsh–Hadamard transform. As a cheap alternative, this paper instead simplifies the problem by strengthening the assumptions, namely, that z is mostly supported on members of Zn2 with Hamming weight ≤ d, that is, the polynomial (W z) ◦ log−1 has degree at most d. This allows us to discard the vast majority of columns of W , leaving only O(nd ) columns to consider. Let Wx,d denote the submatrix of W with row indices in the sample x and column indices of Hamming weight ≤ d. Also, let fx denote the true function f restricted to the sample x. To isolate an estimate fˆ ∈ Cn,k , we first
perform feature selection by finding the columns of Wx,d which look the most like fx . That is, we find the largest > entries of |Wx,d fx | and isolate the corresponding columns of Wx,d . Thanks to the discrete nature of W , some of these columns may be identical up to a global sign factor. As such, we collect columns of Wx,d corresponding > the largest entries of |Wx,d fx | until we have k distinct columns up to sign. Let A denote the resulting ` × k matrix of columns. Then it remains to find a coefficient vector z such that fx ≈ sign(Az), that is, to train a support vector machine. After finding z, we pick fˆ ∈ Cn,k according to (1) by taking the coefficients {ai }ki=1 to be the entries of z and taking the index sets to be Si = {j : uj = 1}, where u ∈ Zn2 is the column index of W corresponding to the ith column of A. We note that alternatively, one could train an `1 -restricted support vector machine22 to find a sparse z such that fx ≈ sign(Wx,d z): min
` X
1 − (fx )i (Wx,d z)i
i=1
s.t.
(4) +
kzk1 ≤ τ.
However, we found this to be slow in practice, even for small values of d. Still, we followed the intent of this method by applying a loose `1 restriction to our support vector machine training, albeit after we performed feature selection. We note that for a fixed d, our two-step method (feature selection, then support vector machine training) runs in time which is polynomial in n, k and `. Unfortunately, we currently lack a performance guarantee of the form (3). Instead, we apply this method to real-world data in the following section to illustrate its effectiveness (as well as the quality of the function model Cn,k ).
5. IMPLEMENTATION WITH HANDWRITTEN DIGITS The MNIST database of handwritten digits14 contains 5923 zeros and 6742 ones, and each digit image is represented by a 28 × 28 matrix with entries in [0, 1]. To keep runtime reasonable, we reduced the image to a 5 × 5 matrix by convolving with the indicator function of a 5 × 5 block and sampling over a 5 × 5 grid. We then thresholded the entries to obtain vectors in {±1}25 ; typical results of this process are illustrated in Figure 1. A −1 label was assigned to the zeros, and ones were similarly labeled with a 1. At this point, classification amounts to learning a function f : {±1}25 → {±1}. After processing the data in this way, we implemented our method of feature selection and support vector machine training as detailed in Section 4. We assumed the polynomial f = (W z) ◦ log−1 has degree at most d, and we fix d = 3. We chose this value because we found that increasing d greatly increases runtime without empirically improving the classifier. In order to choose a sparsity level k, we performed feature selection and trained a support vector machine for multiple choices of k. Intuitively, taking k too small will overly simplify the model and fail to match the inherent complexities of the data, while large choices for k will lead to overfitting. Before picking k, we first let this parameter range from 10 to 280 in increments of 10, and for each value of k, we performed 10 experiments in which we chose disjoint training and testing sets of sizes 1500 and 2500 respectively from each label class. These sets were chosen uniformly at random without replacement. We decided to make the training set small relative to the entire database due to the long runtime required to train a support vector machine—we used a much larger training set after we identified the “optimal” k. Given a training sample, we performed feature selection as described in Section 4 and trained a support vector machine as is in (4) with a loose `1 restriction (τ = 1000). We then calculated the empirical risk using the test set and plotted the results in the left portion of Figure 2. This figure shows the sparsity level k versus the mean empirical risk with error bars denoting one standard deviation from the mean. The large error bars for k = 50 are due to a single extreme outlier. Observe that empirical risk decreases and plateaus at around k = 150, and while we would expect the curve to trend upward for larger k due to overfitting, we terminated our computations at k = 180 due to computation time. For each trial of our experiment, we also recorded the misclassification rate in the training set. The difference between the test and training misclassification rates forms a proxy for R(fˆ, f ) − Rx (fˆ, f ). Qualitatively, the plot
Figure 1. Sample of MNIST database of handwritten digits.14 (top row) Each image is a 28 × 28 matrix with entries in [0, 1]. (bottom row) In order to minimize the runtime of learning, we decided to convert each image into a 5 × 5 image by convolving with the indicator function of a 5 × 5 block and sampling over a 5 × 5 grid. Finally, we thresholded the entries to produce a vector in {±1}25 . Identifying the label zero with −1 and one with 1, the classification task amounts to learning a function f : {±1}25 → {±1}, and so we apply the method described in Section 4.
−3
0.035 8
0.03
6
0.025
4
0.02
2
0.015
0
0.01
−2
0.005
−4
0
x 10
0
50
100
150
200
250
300
−6
0
50
100
150
200
250
300
Figure 2. (left) For each k = 10 : 10 : 280, we ran 10 trials of the following experiment: Draw a random training set of 1500 zeros and 1500 ones, train a classifier with d = 3 and parameter k according to Section 4, draw a random test set of 2500 zeros and 2500 ones (disjoint from the training set), and record the misclassification rate. Here, we plot the average rate along with errors bars indicating one standard deviation above and below. For k = 50, the error bars are large due to a single outlier. The misclassification rate plateaus after around k = 150, and so to minimize computational complexity, we selected k = 150 for our final classifier. For larger values of k, we expect the misclassification rate to increase due to overfitting, but long runtimes prevented us from performing such experiments. (right) For each trial of our experiment, we also recorded the misclassification rate in the training set. The difference between the test and training misclassification rates forms a proxy for R(fˆ, f ) − Rx (fˆ, f ). Qualitatively, the plot of these differences matches the behavior predicted by the square-root term in Theorem 2, but the values in the plot are orders of magnitude smaller. This suggests that the guarantee provided by the theorem is conservative in our setting.
Figure 3. All 28 digits that were misclassified by our classifier. Can you guess which are which? We suspect the classifier would perform better if it received the original images instead of downsampled versions, but this would be computationally costly. Indeed, we believe new algorithms need to be developed before our techniques can compete with the state of the art. Spoiler: The last 7 digits are ones, and the rest are zeros.
of these differences matches the behavior predicted by the square-root term in Theorem 2, but the values in the plot are orders of magnitude smaller. This suggests that the guarantee provided by the theorem is conservative in our setting. Since empirical risk fails to noticeably decrease after about k = 150, we selected this value for k and performed our variable selection and training process on a large training set. Specifically, we randomly chose a training set consisting of 4000 zeros and 4000 ones (approximately two thirds of the entire database). Our test set then consisted of 1900 zeros and 1900 ones. With these training and test sets, our choice of k achieved a misclassification rate of 0.74% after a total runtime of 160 seconds. Considering we greatly downsampled our data from the MNIST database and we only attempted to classify zeros and ones, there is no direct comparison to be made with existing results.14 Still, it is worth mentioning that the best SVM classifier exhibits a misclassification rate (on all digits 0 through 9) of 0.56%, suggesting that our results are reasonable. To make this point stronger, Figure 3 displays all 28 misclassified digits from the test set. We contend that a human would likely misclassify these digits as well. Can you pick out the zeros from the ones?
6. CONCLUSION AND OPEN PROBLEMS This paper demonstrates the plausibility of learning Boolean functions with concentrated spectra, as well as its applicability to modern classification theory and application. However, this paper offers more questions than answers. For example, while we showed in Section 2 that the VC dimension h of Cn,k satisfies max{n, k} ≤ h ≤ 2nk − O(k log k), we have yet to identify how h scales with n and k. This leads to our first open problem: Problem 8. Determine the VC dimension of Cn,k . In Section 3, we proposed an algorithm for empirical risk minimization over polynomials in Cn,k of degree at most d. However, we still don’t know if this algorithm produces an estimate fˆ that satisfies a guarantee of the form (3). Problem 9. Find a performance guarantee for the algorithm proposed in Section 3. Problem 7 also remains open, but before this can be solved, one must first devise a candidate algorithm. This leads to the following intermediate problem: Problem 10. Let W denote the 2n ×2n Walsh–Hadamard transform matrix, and let z be some (nearly) k-sparse n vector in R2 . Given a sample x of ` entries of f = sign(W z), find fˆ ∈ Cn,k that well approximates f (at least empirically) in poly(n, k, `) time. This is perhaps the most important open problem in this paper. Considering our implementation in Section 5, the algorithm proposed in Section 3 exhibits certain computational bottlenecks due to the poor dependence on d. As such, the methods of this paper might fail to compete with state-of-the-art classification until we find a solution to Problem 10. Finally, while Section 5 demonstrated the effectiveness of our model for handwritten digits, we have yet to determine the full scope of its applicability. This suggests the need for more numerical experiments, but there is also a theoretical result to seek. Indeed, our approach was motivated by a certain hypothesis, and the confirmation of this hypothesis remains an open problem: Problem 11. Prove that Boolean functions that are well approximated by learnable deep neural networks also enjoy a highly concentrated Fourier transform. Such a result would establish that empirical risk minimization over Cn,k amounts to a relaxation of the corresponding optimization over deep neural networks, and so our model would consequently inherit the realworld utility of such networks.
ACKNOWLEDGMENTS This work was supported by an AFOSR Young Investigator Research Program award, NSF Grant No. DMS1321779, and AFOSR Grant No. F4FGA05076J002. The views expressed in this article are those of the authors and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the U.S. Government.
REFERENCES [1] N. Alon, P. Frankl, V. R¨ odl, Geometrical realization of set systems and probabilistic communication complexity, FOCS (1985) 277–280. [2] D. Amelunxen, M. Lotz, M. B. McCoy, J. A. Tropp, Living on the edge: Phase transitions in convex programs with random data, Inform. Inference 3 (2014) 224–294. [3] J. Anden, S. Mallat, Deep scattering spectrum, IEEE Trans. Signal Proc. 62 (2014) 4114–4128. [4] J. Bruna, A. Szlan, Y. LeCun, Signal recovery from pooling representations, Available online: arXiv:1311.4025 [5] A, Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, Y. LeCun, The loss surfaces of multilayer networks, Available online: arXiv:1412.0233 [6] A. Efrati, Google beat Facebook for DeepMind, Creates ethics board, The Information, Jan. 27, 2014, Available online: https://www.theinformation.com/ Google-beat-Facebook-For-DeepMind-Creates-Ethics-Board [7] Y. Gordon, On Milman’s inequality and random subspaces which escape through a mesh in Rn , Geometric aspects of functional analysis (1986/87) 84–106. [8] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Y. Ng, Deep Speech: Scaling up end-to-end speech recognition, Available online: arXiv:1412.5567 [9] J. Hastad, Computational limitations of small-depth circuits, MIT Press, Cambridge, Mass., 1987. [10] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, Available online: arXiv:1502.01852 [11] D. Hernandez, Facebooks quest to build an artificial brain depends on this guy, Wired, Aug. 14, 2014, Available online: http://www.wired.com/2014/08/deep-learning-yann-lecun/ [12] J. Jackson, A. Klivans, R. Servedio, Learnability beyond AC 0 , STOC (2012) 776–784. [13] L. Jacques, J. N. Laska, P. T. Boufounos, R. G . Baraniuk, Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors, IEEE Trans. Inform. Theory 59 (2013) 2082–2102. [14] Y. LeCun, C. Cortes, C. J. C. Burges, The MNIST database of handwritten digits, Available online: http://yann.lecun.com/exdb/mnist/ [15] X. Li, J. K. Bradley, S. Pawar, K. Ramchandran, Robustifying the sparse Walsh-Hadamard transform without increasing the sample complexity of O(K log N ), Available online: https://www.eecs. berkeley.edu/~kannanr/assets/project_ffft/WHT_noisy.pdf [16] N. Linial, Y. Mansour, N. Nisan, Constant depth circuits, Fourier transform, and learnability, J. ACM 40 (1993) 607–620. [17] G. F. Montufar, R. Pascanu, K. Cho, Y. Benjio, On the number of linear regions of deep neural networks, NIPS (2014) 1–9. [18] A. Shpilka, A. Tal, B. lee Volk, On the structure of Boolean functions with small spectral norm, Available online: arXiv:1304.0371 [19] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the gap to human-level performance in face verification, CVPR (2014) 1701–1798. [20] E. van den Berg, M. P. Friedlander, Theoretical and empirical results for recovery from multiple measurements, IEEE Trans. Inform. Theory 56 (2010) 2516–2527. [21] V. N. Vapnik, The nature of statistical learning theory, 2nd ed., Springer, 2000. [22] J. Zhu, S. Rosset, T. Hastie, R. Tibshirani, 1-norm support vector machines, NIPS 16 (2004) 49–56.