Computational Cost Reduction in Learned ... - Semantic Scholar

Report 6 Downloads 75 Views
1

Pattern Recognition Letters journal homepage: www.elsevier.com

Computational Cost Reduction in Learned Transform Classifications

arXiv:1504.06779v1 [cs.CV] 26 Apr 2015

Emerson Lopes Machadoa,∗∗, Cristiano Jacques Miossoc , Ricardo von Borriesd , Murilo Coutinhob , Pedro de Azevedo Bergera , Thiago Marquese , Ricardo Pezzuol Jacobia a Dept.

of Computer Science, University of Brasilia, Brazil of Statistics, University of Brasilia, Brazil c University of Brasilia at Gama, Brazil d Dept. of Electrical and Computer Engineering, University of Texas at El Paso, USA e Dept. of Mechatronics, University of Brasilia, Brazil b Dept.

ABSTRACT We present a theoretical analysis and empirical evaluations of a novel set of techniques for computational cost reduction of classifiers that are based on learned transform and soft-threshold. By modifying optimization procedures for dictionary and classifier training, as well as the resulting dictionary elements, our techniques allow to reduce the bit precision and to replace each floating-point multiplication by a single integer bit shift. We also show how the optimization algorithms in some dictionary training methods can be modified to penalize higher-energy dictionaries. We applied our techniques with the classifier Learning Algorithm for Soft-Thresholding, testing on the datasets used in its original paper. Our results indicate it is feasible to use solely sums and bit shifts of integers to classify at test time with a limited reduction of the classification accuracy. These low power operations are a valuable trade off in FPGA implementations as they increase the classification throughput while decrease both energy consumption and manufacturing cost. Moreover, our techniques reduced 50% of the bit precision in almost all datasets we tested, enabling to use 32-bit operations instead of 64-bit in GPUs, which can almost double the classification throughput. © 2015 Elsevier Ltd. All rights reserved.

1. Introduction In image classification, feature extraction is an important step, specially in domains where the training set has a large dimensional space that requires a higher processing and memory resource. A recent trend in feature extraction for image classification is the construction of sparse features, where these features consist in the representation of the signal in an overcomplete dictionary. When the dictionary is learned specific to the input dataset, the classification of sparse features can achieve results comparable to state-of-the-art classification algorithms (Mairal et al., 2012). However, this approach has a drawback at test time, as the sparse coding of the input test sample is computationally intense, being impracticable to embedded applications that have scarce computational and power resources. A recent approach to this drawback is to learn a sparsify-

∗∗ Corresponding

author: e-mail: [email protected] (Emerson Lopes Machado)

ing transform from the target image dataset (Fawzi et al., 2014; Shekhar et al., 2014; Ravishankar and Bresler, 2013). At test time, this approach reduces the sparse coding of the input image to a simple matrix-vector multiplication followed by a softthreshold, which can be efficiently realized in hardware due to its inherent parallel nature. Nevertheless, these matrix-vector multiplications require floating-point operations, which may have a high cost in hardware, specially in FPGA, as it requires a much larger area when working with floating-point operations, and higher energy consumption. Exploring some properties we derive from these classifiers, we propose a set of techniques to reduce their computational cost at test time, which we divide into four main groups: (i) use test images in its raw representation (integer) instead of their normalized version (floating-point) and thus replace the costly floating-point operations by integer operations, which are cheaper to implement in hardware and do not affect the classification accuracy; (ii) discretize both transform dictionary and classifier by approximating its elements to their nearest power

2 of 2 and thus replace all multiplications by simple bit shifts, at the cost of a slight decrease in the classification accuracy; (iii) decrease the dynamic range of the test images by reducing the quantization level of the integer valued test images; (iv) and decrease the dynamic range of the dictionary first by penalizing the `2 norm of its entries at the training phase and second by zeroing out its entries that have absolute values smaller than a trained threshold. The last two techniques reduce the bit precision of the matrix-vector multiplication at the cost of a slight decrease in the classification accuracy. As a study case for our techniques, we use a recent classification algorithm named Learning Algorithm for SoftThresholding classifier (LAST), which learns both the sparse representation of the signals and the hyperplane vector classifier at the same time. Our tests use the same datasets used in the paper that introduces LAST and our results indicate that our techniques reduce the computational cost while not substantially degrading the classification accuracy. Moreover, in a particular dataset we tested, our techniques substantially increased the classification accuracy. In this work, all simulations we ran to test our techniques were performed on image classification using LAST. Nevertheless, our proposed techniques are sufficiently general to be applied on different problems and different classification algorithms that use matrix-vector multiplications to extract features, such as Extreme Learning Machine (ELM) (Huang et al., 2006) and Deep Neural Networks (DNN) (Schmidhuber, 2015). To the best of our knowledge, this paper presents the first generic approach to reduce the computational cost at test time of classifiers that are based on learned transform. This has a valuable application in embedded systems where power consumption is critical and computational power is restricted. Furthermore, these techniques dismiss the necessity of using DSPs for intense matrix-vector operations in FPGAs architectures in the context of image classification, lowering the overall manufacturing cost of embedded systems.

at a time. This problem is NP-hard, but an approximate solution can be obtained by using the `1 norm in place of the `2 norm

0

α 1 s.t. x = Dα, (2) α = min 0 α

where k·k1 is the `1 norm. The solution of (2) can be computed by solving the problem of minimizing the `1 norm of the coefficients among all decompositions, which is convex and can be solved efficiently. If the solution of (2) is sufficiently sparse, it will be equal to the solution of (1) (Donoho and Huo, 2001). Sparse coding transform (Ravishankar and Bresler, 2013) is another way of sparsifying a signal, where the dictionary is a linear transform that maps the signal to a sparse representation. For example, signals formed by the superposition of sinusoids have a dense representation in the time domain and a sparse representation in the frequency domain. For this type of signal, the Fourier transform is the sparse coding transform. Quite simply, D> x = z is the sparse transform of x, where z is the sparse coefficient vector. In general, the transform D, can be a well structured fixed base such as a DFT or learned specifically to the target problem represented in the training dataset. A learned dictionary can be an overcomplete dictionary learned from the signal dataset, as in (Shekhar et al., 2014), a square invertible dictionary, as in (Ravishankar and Bresler, 2013), or even a dictionary without restrictions on the number of atoms, as in LAST (Fawzi et al., 2014). When a signal is corrupted by additive white Gaussian noise (AWGN), its transform will result in a coefficient vector that is not sparse. A common way of making it sparse is to apply a threshold operation to its entries right after the transform, where the entries lower than the threshold are set to zero. Softthreshold is an threshold operator that, in addition to the threshold operation, subtracts the remaining values by the threshold, shrinking them toward zero (Donoho and Johnstone, 1994). Let z = (zi )ni=1 be the coefficients of a sparse representation of a signal si corrupted by AWGN ei given by zi = si +  ei

i = 1, ..., n

(3)

2. Overview of Sparse Representation Classification In this section, we briefly review both synthetical and analytical sparse representation of signals along with the threshold operation used as a sparse coding approach (Section 2.1). We also review LAST (Section 2.2). 2.1. Sparse Representation of Signals d

where ei are independent identically distributed as N(0, 1),  > 0 is the noise level, and si are the coefficients of the sparse representation of the pure signal vector. Because the si coefficients in (3) are sparse, there exist a threshold λ that can separate most of the pure signal si from the noise ei using the soft-thresholding operator hλ (z) = sgn(z) max(0, |z| − λ),

(4)

d×p

Let x ∈ R be a signal vector and D ∈ R be an overcomplete dictionary. The sparse representation problem is to find the coefficient vector α ∈ Rr such that kαk0 is minimum, i.e.,

0

α 0 s.t. x = Dα, α = min (1) 0 α

where k·k0 measures the number of nonzero coefficients. Therefore, the signal x can be synthesized as a linear combination of the k nonzero vectors from the dictionary D, also called synthesis operator. The solution of (1) requires testing all possible sparse vectors α0 , which is a combination of d elements taken k

where sgn(·) is the sign function. For classification tasks, the best estimate of λ can be computed using the training set. 2.2. Learning Algorithm for Soft-Thresholding Classifier (LAST) LAST (Fawzi et al., 2014), is an algorithm based on a learned transform followed by a soft-threshold, as described in Section 2.1. Differently from the original soft-threshold map presented in (4), LAST uses a soft-threshold version that also sets to zero all negative values hα (z) = max(0, z − α), where α is the

arg min D,w

m X

L(yi w> hα (D> xi )) +

i=1

v kwk22 , 2

(5)

where L is the hinge loss function L(x) = max(0, 1 − x) and v is the regularization parameter that prevents the overfitting of the classifier w to the training set. At test time, the classification of each test case x is performed by first extracting the sparse features from the signal x, using f = max(0, D> x − α), followed by the classification of these features using c = w> f > 0, where c is the class returned by the classifier. We direct the reader to (Fawzi et al., 2014), for a deeper understanding of LAST. 3. Proposed Techniques We present in this section our techniques to simplify the test time computations of classifiers that are based on learned transform and soft-threshold. We first present in Section 2.2 the empirical findings that underlie our techniques and afterward our techniques in Section 3.2. 3.1. Theoretical Results on Computational Cost Reduction For the purpose of brevity, we coined the term powerize to concisely describe the operation of approximating a value to its closest power of 2. Theorem 1. The relative distance between any real scalar x and its powerized version is upper bounded by 1/3. Proof. Let 2n ≤ x ≤ 2n+1 , n ∈ Z and d p2 (x) be the distance between x and its powerized version. The distance d p2 (x) is maximum when x is the middle point between both closest power n of 2, which is m = 21 (2n+1 + 2n ) = 22 (2 + 1) = 2n−1 3. Therefore, the distance d p2 (x) when x = m is d p2 (m) = m − 2n = 2n−1 3 − 2n = 2n−1 (3 − 2) = 2n−1 = m3 , and so the maximum relative distance between x and its powerized version is d p2 (m)/m, which is equal to 1/3. We now show how the classification accuracy on the test set behaves when small variations are introduced on the entries of D and w. Using the datasets described in the beginning of this section, we trained 10 different pairs of D and w, with 50 atoms, and created 50 versions of each pair. Each of these versions Di and wi , i = 1, 2, . . . , 50, were built by multiplying the elements of D and w by a random value chosen from the uniform distribution on the open interval (1 − di , 1 + di ), where di ∈ {0.02, 0.04, 0.06, . . . , 1}. Next, we evaluated all them on the test set. The results, shown in Figure 1, indicate a clear trade-off between the classification accuracy and how far the entries of both Di and wi are displaced from the corresponding entries of D and w, which is controlled by the d.

100

100

90

90 Accuracy (%)

threshold, also called sparsity parameter. We chose LAST to be our study case because of its simplicity in the learning process, as it jointly learns the sparsifying dictionary and the classifier hyperplane. For the training cases X = [x1 | . . . |xm ] ∈ Rn×m with labels y = [y1 | . . . |ym ] ∈ {−1, 1}m , the sparsifying dictionary D ∈ Rn×N that contains N atoms and the classifier hyperplane w ∈ RN are estimated using the supervised optimization

Accuracy (%)

3

80 70 60 50 0

80 70 60

0.2

0.4

0.6

0.8

d

(a) bark versus woodgrain

1

50 0

0.2

0.4

0.6

0.8

1

d

(b) pigskin versus pressedcl

Figure 1: Classification accuracy when the elements of D and w are displaced from their original value by a random amount, upper bounded by d. These results were built from the classification of the test set using 10 different pairs of D, with 50 atoms, and w. The datasets used in this simulation are described in Section 4.1.

Hypothesis 1. Both D and w can be powerized at the cost of a small decrease in the classification accuracy. It is worth noting that the Theorem 1 guarantees an upper bound of 1/3 for the relative distance between any real scalar x and its powerized version. Therefore, it is reasonable to hypothesize that the classification accuracy using the powerized pair D power and w power is no worse than using Di and wi , when di = 1/3, shown in Figure 1. To support this hypothesis, we ran another simulation using the datasets described in the beginning of this section. For this simulation, we trained 10 pairs of D and w with different training sets and evaluated them and their respective powerized versions on the test set. Regarding the bark vs. woodgrain dataset, the original model accuracy were 97.33% (0.93) and the proposed model accuracy were 97.00% (1.06). As for the bark vs. woodgrain dataset, the original model accuracy were 84.00% (1.61) and the proposed model accuracy were 82.65% (1.26). Theorem 2. Let D and w be, respectively, the sparsifying dictionary and the linear classifier trained with the normalized training set (`2 norm equals to 1). The classification of the both raw signals (integer values) and normalized signals (`2 norm equals to 1) are exactly the same when the sparsity parameter α is properly adjusted to the raw signals. Proof. Let xint and x be respectively a raw vector from the test set and its normalized version, with kxk2 = 1, and D and w trained with α = 1. Therefore, the extracted features are f = D> x = D> kxxintintk2 and the soft-thresholded feature is fα = max(0, f − α) = max(0, D> kxxintintk2 − 1) = 1 > kxint k2 max(0, D xint − kxint k2 ). Finally, the classification of xint 1 is c = (w kxint k2 max(0, D> xint − kxint k2 ) > 0). As the `2 norm of any real vector different from the null vector is always greater than 0, then kxint1 k2 > 0, and, thus c = (w max(0, D> xint − kxint k2 ) > 0) Therefore, as x = xint / kxint k2 , the expressions c = (w max(0, D> x − α) > 0), with α = 1, and c = (w max(0, D> xint − α) > 0), with α = kxint k2 are equivalent. Empirical evidence 1. Increasing the sparsity of the dictionary D up to a certain level value will decrease the minimum number of bits necessary to store it, and consequently also reduce the

4

100

100

90

90

Accuracy (%)

Accuracy (%)

80 70 60 50

0

1

2 3 zero_threshold

(a) bark versus woodgrain

Number of bits in D

Number of bits in D

15 10 5 0

60 0

1

2 3 zero_threshold

4

1

2 3 zero_threshold

(c) bark versus woodgrain

4

60 50

0

5 10 Quantization level

15

(a) bark versus woodgrain

80 70 60 50

0

5 10 Quantization level

15

(b) pigskin versus pressedcl

Figure 3: Classification accuracy on quantized versions of the test set. These results are the average of the classification results of the test set evaluated with 10 pairs of D and w, with 50 atoms, trained with different training sets. The original results are marked in the position −1 of the Quantization level axis. The datasets are described in Section 4.1.

3.2. Proposed Techniques

Technique 4. Decrease the dynamic range of the entries of D by penalizing their `2 norm during the training followed by hard-thresholding them, using a trained threshold. Our strategy to decrease the dynamic range of the dictionary D involves the addition of a penalty to the `2 norm of its entries during the minimization of the objective function of LAST, described in (5). The new objective function becomes

20 15 10 5 0

0

70

Technique 3. Decrease the dynamic range of the test set Xint by quantizing the integer valued test images Xint .

70

25

20

90

80

Technique 2. Powerize D and w.

(b) pigskin versus pressedcl

25

100

90

Technique 1. Use signals in its raw representation (in integer) instead of their normalized version (in floating-point).

80

50

4

100 Accuracy (%)

We hypothesized that forcing D to be sparse would decrease its dynamic range with no substantial decrease of its classification accuracy. To test our hypothesis we performed another simulation with the datasets described in the beginning of this section. For each of the 14 threshold values linearly spaced between 0 and 4, we averaged the results of 10 pairs of D and w trained for different training sets evaluated on the test set. As shown in Figure 2(c), the first cut different from zero already reduces the number of bits to represent D in half while unexpectedly increasing its classification accuracy. Also, the third cut different from zero on D shown in Figure 2(d) maintains its classification accuracy while reducing its dynamic range to less than half of the original.

Accuracy (%)

number of bits to compute the sparse representation D> X, at the cost of a slight classification accuracy decrease.

0

1

2 3 zero_threshold

4

(d) pigskin versus woodgrain

Figure 2: Classification accuracy for hard threshold values applied to the dictionary D. These are the average of the classification results on the test set evaluated with 10 pairs of D and w, with 50 atoms, trained with different training sets. The original results are shown with zero_threshold = 0. The datasets are described in Section 4.1.

Empirical evidence 2. Decreasing the quantization level of the integer valued test images Xint up to a certain level will decrease the dynamic range of Xint at the cost of a slight classification accuracy decrease. We also hypothesized the original continuous signal may be unnecessarily over quantized and its quantization level may be decreased while not substantially affecting the classification accuracy. To test our hypotheses, we performed another simulation with the binary datasets described in the beginning of this section. In this simulation, we averaged the results of one thousand runs consisting in 10 pairs of D and w trained for different training sets evaluated on the test set. Each training set X was quantized from 1 to 15 quantization levels. The results are shown in Figure 3. Its worth noting in this figure that both datasets can be reduced to 2 bit (Quantization level equals to 2 and 3) with a limited decrease of the classification accuracy.

arg min D,w

m X i=1

L(yi w> hα (D> xi )) +

κ v kwk22 + kDk22 , 2 2

(6)

where κ controls this new penalization. In Section 3.3, we show our proposed technique of including this penalization into general constrained optimization algorithms. After training D and w using the modified objective function (6), we apply a hard-threshold to its entries to zero out the values closer to zero. Our assumption is that these small values of D have little contribution on the final feature value and, thus, can be set to zero without affecting much the classification accuracy. As for the threshold value, we test the best one from all unique absolute values of D after it has been powerized using our Technique 2. As the number of unique absolute values of D is substantially reduced after using the Technique 2, the computational burden to test all possible values is greatly reduced. 3.3. Inclusion of an `2 Norm Penalization Term in Dictionary Training Algorithms Based on Constrained Optimization We show how to include a term into the objective function that penalizes potential dictionaries whose entries have larger energy values, as opposed to lower-energy dictionaries. By favoring vectors with lower energies, we may obtain dictionaries which span over narrower ranges of values. We follow our explanation by showing how to include this penalization into gradient descent (GD) methods, which is one of the most used methods for optimization (Boyd and Vandenberghe, 2004).

5 Several dictionary and classifier training methods are based on constrained optimization programs such as (Fawzi et al., 2014; Ravishankar and Bresler, 2013) minimize V,w

f (V, w)

subject to g(V, w) = 0,

(7)

where: (i) V is an n1 × 1 vector containing the dictionary terms and w is an n2 × 1 vector of classifier parameters; (ii) f : Rn → R, n = n1 + n2 , is the cost function based on the training set; (iii) 0 is the null vector; (iv) and g : Rm → R is a function representing m scalar equality constraints. Some methods also include inequality constraints. In order to penalize the total energy associated to the dictionary entries, we can replace any problem of the form (7) by minimize V,w

subject to

1 f (V, w) + κ kVk22 2 g(V, w) = 0,

(8)

where κ > 0 is a penalization weight. Iterative methods are commonly used to solve constrained optimization problems (Boyd and Vandenberghe, 2004) such as (8). They start with an initial value x0 = [V0 w0 ]T for x = [V w]T , which is iterated to generate a supposedly convergence sequence x(n) satisfying x(n+1) = x(n) + ξ∆x(n) , ∀ n ≥ 0,

(9)

where ξ is the step size and ∆x(n) = [∆V(n) ∆w(n) ] is the step computed based on the particular iterative method. We consider the method GD, where computing ∆x(n) requires evaluating the gradient of a dual function associated with the objective function and the constraints (Boyd and Vandenberghe, 2004). Specifically, the Lagrangian L(V, w) is an example of a dual function, thus having a local maximum that is a minimum of the objective function at a point that satisfies the constraints. For problems (7) and (8), the Lagrangian functions are given respectively by L(V, w, λ) = f (V, w) + λT g(V, w) and 1 ˆ L(V, w, λ) = f (V, w) + λT g(V, w) + κ kVk22 , 2

(10) (11)

with λ the vector of m Lagrange multipliers. Our first objective regarding solving the modified probˆ lem (8) is to compute the gradient of L(V, w, λ) in terms of the gradient of L(V, w, λ), so as to show how a problem that solves (7) can be modified in order to solve (8). 3.3.1. Including the Penalization Term in GD Methods In GD optimization methods, the step ∆x(n) depends directly ˆ upon the gradient of the dual function L(x), as evaluated at (n) (n) (n) T x = [V w ] (Boyd and Vandenberghe, 2004). We now establish the relation between Lˆ and L, in order to determine the modification to such methods in order to include the penalization we propose.

By comparing (10) and (11), and by defining ∇v g as the gradient of any function g with respect to vector v, note that 1 ˆ ∇V L(V, w, λ) = ∇V L(V, w, λ) + κ ∇V kVk22 , 2 ˆ ∇w L(V, w, λ) = ∇w L(V, w, λ), and ˆ ∇λ L(V, w, λ) = ∇λ L(V, w, λ). ˆ As ∇V 12 kVk22 = V, it is easy to see that ∇V L(V, w, λ) = ∇V L(V, w, λ) + κV. In summary, the gradient of the modified Lagrangian Lˆ can be computed from the original Lagrangian used in a given optimization problem by using the expressions ˆ ∇V L(V, w, λ) = ∇V L(V, w, λ) + κV, ˆ ∇w L(V, w, λ) = ∇w L(V, w, λ), and

(12)

ˆ ∇λ L(V, w, λ) = ∇λ L(V, w, λ).

(14)

(13)

Equations (12), (13), and (14) show how we modify the estimated gradient in any GD method (such as LAST (Fawzi et al., 2014)) in order to penalize the range of the dictionary entries, and thus try to force a solution with a narrower range. Note that only the gradient with respect to the dictionaries is altered. 3.4. Model Selection Our simulations, described in Section 4, generated many different models due to the large number of parameter combinations of both Technique 3 and Technique 4. To select the best combination of the parameters κ, z_threshold, and quanta we relied on the classification accuracy on a separate data set. The range of these parameters is defined in Section 4.1, as well as the parameter that controls the trade-off between the classification accuracy and the bit resolution of the final classifier, denoted by γ. We used the following steps for the model selection: (i) First, we used 80% of the training set to train the models (D and w) and used the remaining 20% to estimate the best combination of the parameters κ, z_threshold, and quanta. (ii) Let M be the set of models trained with all combinations of the parameters κ, z_threshold, and quanta. Also, let R = M(X) be the set of the classification results of the training set X using the models M and best_acc be the best training accuracy from R. (iii) From M, we create the subset Mγ that contains the models with results Rγ = R[accuracy >= (1 − γ) best_acc]. (iv) From Mγ , we create a new subset Mbits with results Rbits = Rγ [number of bits == lowest_num_bits], where lowest_num_bits is the lowest number of bits necessary for the computation of D> X. (v) From Rbits , we finally choose the model Mbest such that the result Rbest = Rbits [sparsest representation of X]. The traditional rule of thumb of using 2/3 of the dataset to train and 1/3 to test is a safe way of estimating of the true classification accuracy when the classification accuracy on the whole dataset set is higher than 85% (Dobbin and Simon, 2011). As we are solely reserving part of the training set for the selection of the best parameters values, and not for the estimation of the true classification accuracy, we opted for the more conservative proportion of 80% to train our models. This has the advantage

6

60

60

50

50

50

40 30 20 10 0

Original Proposed

0

100 200 300 Dictionary size

40 30 20 10 0

400

Original Proposed

0

100 200 300 Dictionary size

(a) bark versus wood- (b) pigskin grain pressedcl

Number of bits D'X

60 Number of bits D'X

400

versus

40 30 20 10 0

Original Proposed

0

100

100

100

90

90

90

80 70 60 50

Original Proposed

0

100 200 300 Dictionary size

400

80 70 60 50

Original Proposed

0

100 200 300 Dictionary size

(d) bark versus wood- (e) pigskin grain pressedcl

400

versus

100 200 300 Dictionary size

400

(c) CIFAR-10 deer versus horse

Accuracy (%)

4.1. Datasets and Choice of the Parameters We used five out of the six datasets used in the paper that describes LAST (Fawzi et al., 2014), because we could not find the USPS dataset in integer. The simulations consist in training D and w with the original version and the modified version of LAST made with our techniques presented in Section 3.2. The first two datasets contain patches of textures extracted from the Brodatz dataset (Valkealahti and Oja, 1998). As in (Fawzi et al., 2014), the first task consisted in discriminating between the images bark versus woodgrain and the second task consisted in the discrimination of pigskin versus pressedcl. First, we separated both images in two disjoint pieces and took the training patches from one piece and the test patches from the other one. As in (Fawzi et al., 2014), the training and test sets were built with 500 patches of the textures with size of 12 × 12 pixels. These patches were transformed into vectors and then normalized to have `2 norm equals to 1. The third binary dataset was built using a subset of the CIFAR-10 image dataset (Krizhevsky, 2009). This dataset contains 10 classes of 60 000 32 × 32 tiny RGB images, with 50 000 images in the training set and 10 000 in the test set. Each image has 3 color channels and it is stored in a vector of 32 × 32 × 3 = 3072 positions. The chosen images are those labeled as deer and horse. The first multiclass dataset was the MNIST dataset (LeCun et al., 1998), which contains 70 000 images of handwritten digits of size 28 × 28 distributed in 60 000 images in the training set and 10 000 images in the test set. As in (Fawzi et al., 2014), all images have zero-mean and `2 norm equals to 1. The last task consisted in the classification of all 10 classes from the CIFAR-10 image dataset. For all datasets, we fixed the parameter κ = {4, 8, 10, . . . , 20}∗ 10−3 and let z_threshold assume all unique values of the powerized version of D power , i.e., after applying the Technique 2. As the number of unique values of D power is substantially lower than the ones of D, the necessary computational burden to test all valid thresholds is low. Also, we fixed the quantization parameter quanta = {1, 2, . . . , 10} ∪ {31, 127}. At last, we fixed the trade-off parameter on γ = 0.001. The choice of these parameter values was empirically based on a previous run of all simulations. As for the parameters in LAST, we used the same used in (Fawzi et al., 2014). We direct the reader to (Fawzi et al., 2014) for further understanding of the parameters and their values used in LAST.

Accuracy (%)

In this section, we evaluate how our techniques affect the accuracy of LAST on the same datasets used in (Fawzi et al., 2014), which are described in Section 4.1, along with the parameters we chose to evaluate our techniques, and, at last, the analysis of the results we obtained comes in Section 4.2.

In this section, the original results are the ones from the classification of the test set using the model built with the original LAST algorithm. Conversely, the proposed results are the ones obtained from the classification of the test set using the best model Rbest built for each dataset. The best model Rbest is the one selected using the methodology presented in Section 3.4. We show the results of our simulations on the binary tasks in Figure 4. As shown in Figures 4(d), 4(e), and 4(f), our techniques do not substantially decrease the original classification accuracy. At the same time, our techniques considerably reduce the number of bits necessary to perform the multiplication D> X, as shown in Figures 4(a), 4(b), and 4(c). This reduction allow the use of 32-bit single-precision floating-point in GPUs instead of 64-bit double-precision floating-point, which increases the computational throughput (Du et al., 2012). One can note the original results in Figures 4(d) and 4(e) are lower than the ones presented in (Fawzi et al., 2014). Differently from their work, we used disjoint training and test sets to allow a better estimation of the true classification accuracy.

Number of bits D'X

4. Simulations

4.2. Results and Analyses

Accuracy (%)

of lowering the chance of missing an underrepresented training set sample. Moreover, the last step in our model selection algorithm selects the model that produces the sparsest signal representation, as it leads to models that generalize better (Bengio et al., 2013).

80 70 60 50

Original Proposed

0

100 200 300 Dictionary size

400

(f) CIFAR-10 deer versus horse

Figure 4: Comparison of the results using the original LAST algorithm and our proposed techniques. Regarding the classification at test time, these figures show for each dataset the trade-off between the necessary number of bits (top) and the classification accuracy (bottom). Our approach reduces the necessary number of bits to almost half of the original formulation at the cost of a slight decrease of the classification accuracy. The datasets are described in Section 4.1.

Table 1 contains the results of the simulations on the tasks MNIST and CIFAR-10. The original results we obtained for both large datasets have higher classification error than the ones reported in (Fawzi et al., 2014). We hypothesize that this is caused by the random nature of LAST for larger datasets, where each GD is optimized for a small portion of the data called minibatch, which is randomly sampled from the training set. Moreover, we trained D and w using 4/5 of the training set used in (Fawzi et al., 2014) and this may negatively affect the generalization power of the dictionary and classifier. Note that our techniques resulted in a slight increase of the

7 classification error on the MNIST task. Nevertheless, our techniques reduced the number of bits necessary to run the classification at test time to less than half. Again, this dynamic range reduction is highly valuable for applications on both GPU and FPGA. As for the CIFAR-10 task, our techniques produced a model that has substantial lower error than the original model using almost half of the necessary number of bits at test time. Table 1: Comparison between the original and the proposed results regarding the classification error and number of bits necessary to compute the matrixvector multiplication D> X of the sparse representation.

w to the nearest power of 2; and increase the sparsity of the dictionary D by applying a hard-thresholding to its entries. We ran simulations using the same datasets used in the original paper that introduces LAST and our results indicate that our techniques substantially reduce the computation load at a small cost of the classification accuracy. Moreover, in one of the datasets tested there was a substantial increase in the accuracy of the classifier. These proposed optimization techniques are valuable in applications where power consumption is critical. Acknowledgments

MNIST

Original Proposed

CIFAR-10

Error %

# bits D> X

Error %

# bits D> X

2.48 2.52

61 29

60.00 53.08

55 29

The results we presented in this section indicate the feasibility of using integer operations in place of floating-point ones and use bit shifts instead of multiplications with a slight decrease of the classification accuracy. These substitutions reduce the computational cost of classification at test time in FPGAs, which is important in embedded applications, where power consumption is critical. Moreover, our techniques reduce almost in half the number of bits necessary to perform the most expensive operation in the classification, the matrix-vector multiplication D> X. This was a result of the application of both Technique 3 and Technique 4, which enables the use of 32bit single-precision floating-point operations in place of 64-bit double-precision floating-point operations in GPUs, which can almost double their computational throughput (Du et al., 2012). Also, it is worth noting that our techniques were developed to reduce the computational cost of the classification with an expected accuracy reduction, within acceptable limits. Nevertheless, the classification accuracies on the bark versus woodgrain dataset using our techniques substantially outperforms the accuracies using the original model, as shown in Figure 4(a). These new higher accuracies were unexpected. Regarding the original models, we noted that the classification accuracies on the training set were 100% when using dictionaries with at least 50 atoms. These models were probably overfitted to the training set, making them fail to generalize to new data. As our powerize technique introduces a perturbation to the elements of both D and w, we hypothesize that it reduced the overfitting of D and w to the training set and, consequently, increased their generalization power on unseen data (Pfahringer, 1995). However, this needs further investigation. 5. Conclusion This paper presented a set of techniques for the reduction of the computations at test time of classifiers that are based on learned transform and soft-threshold. Basically the techniques are: adjust the threshold so the classifier can use signals represented in integer instead of their normalized version in floatingpoint; reduce the multiplications to simple bit shifts by approximating the entries from both dictionary D and classifier vector

This work was partially supported by a scholarship from the Coordination of Improvement of Higher Education Personnel (Portuguese acronym CAPES). We thank the Dept. of ECE of the UTEP for allowing us access to the NSF-supported cluster (NSF CNS-0709438) used in all the simulations here described and also Mr. N. Gumataotao for his assistance with it. We thank Mr. A. Fawzi for the source code of LAST and all the help with its details. We also thank Dr. G. von Borries for fruitful cooperation and discussions. References Bengio, Y., Courville, A., Vincent, P., 2013. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1798–1828. Boyd, S.P., Vandenberghe, L., 2004. Convex Optimization. Cambridge University Press. Dobbin, K.K., Simon, R.M., 2011. Optimally splitting cases for training and testing high dimensional classifiers. BMC medical genomics 4, 31. Donoho, D.L., Huo, X., 2001. Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory 47, 2845–2862. Donoho, D.L., Johnstone, I.M., 1994. Ideal spatial adaptation by wavelet shrinkage. Biometrika Trust 81, 425–455. Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J., 2012. From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Computing 38, 391–407. Fawzi, A., Davies, M., Frossard, P., 2014. Dictionary Learning for Fast Classification Based on Soft-thresholding. International Journal of Computer Vision , 1–16. Huang, G.B., Zhu, Q.Y., Siew, C.K., 2006. Extreme learning machine: theory and applications. Neurocomputing 70, 489–501. Krizhevsky, A., 2009. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE. Institute of Electrical and Electronics Engineers 86, 2278–2324. Mairal, J., Bach, F., Ponce, J., 2012. Task-Driven Dictionary Learning. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, 791–804. Pfahringer, B., 1995. Compression-Based Discretization of Continuous Attributes, in: Proc. 12th International Conference on Machine Learning, unknown. pp. 456–463. Ravishankar, S., Bresler, Y., 2013. Learning Sparsifying Transforms. IEEE Transactions on Signal Processing 61, 1072–1086. Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks : the official journal of the International Neural Network Society 61, 85–117. Shekhar, S., Patel, V.M., Chellappa, R., 2014. Analysis sparse coding models for image-based classification, in: IEEE International Conference on Image Processing. Proceedings. Valkealahti, K., Oja, E., 1998. Reduced multidimensional co-occurrence histograms in texture classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 90–94.